From 5572aa925230a8d3ca6b77753ecae1cf31d3f21c Mon Sep 17 00:00:00 2001 From: Andrew Chen Date: Mon, 1 Jun 2026 03:41:53 -0700 Subject: [PATCH 01/13] fix+optimize: gpu-operator-install and gpu-operator-references skills MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Applies verified #401 review findings and the doc-optimize-skill quality pass to the two highest-blast-radius skills. gpu-operator-install (renamed from the converter typo gpu-operator-install-ing-nvidia, finding #6/#1): - Restored the 4 silently-dropped prerequisites from getting-started.rst (ClusterPolicy/OS constraints, container engine, PSA labeling, NFD detection) — finding #1. - Inlined the dropped tf-notebook.yaml literalinclude — finding #2. - Fixed 5 broken :external+ Sphinx roles to real OpenShift/CTK/Edge URLs — finding #7. - Mapped lost :ref:/:doc: cross-refs to (use the X skill) or published doc links — finding #10. - Fixed nvaie-tanzu_ RST link leak in the platforms table. - Dropped the Trigger keywords suffix; added triggers:/tags: arrays — finding #13. - Removed Step N: H2 prefixes — finding #14. - Converted flat-bold admonitions to GitHub alerts — finding #15. - Replaced 11 ${version} leaks with v26.3.1. gpu-operator-references: - Rewrote the mis-routed description (claimed confidential-containers only; loads 7 references) — finding #5; dropped Trigger suffix, added triggers:/tags:. - overview.md: fixed missing image asset (#9), 16 pstai_ link leaks (#8), lost cross-refs (#10), :external+ocpindex role (#7), identifieis typo (#12). - security.md: removed duplicated phrase (#11). - life-cycle-policy.md / platform-support.md / release-notes.md: converted flat-bold admonitions to GitHub alerts (#15) and repaired admonition-boundary over-captures + an ocp_csp_support substitution leak surfaced during conversion. Co-Authored-By: Claude Opus 4.8 (1M context) Signed-off-by: Andrew Chen --- .../SKILL.md | 185 +++++++++++++----- .../skills/gpu-operator-references/SKILL.md | 18 +- .../references/life-cycle-policy.md | 23 +-- .../references/overview.md | 47 ++--- .../references/platform-support.md | 74 ++++--- .../references/release-notes.md | 60 +++--- .../references/security.md | 2 +- 7 files changed, 244 insertions(+), 165 deletions(-) rename gpu-operator/.agents/skills/{gpu-operator-install-ing-nvidia => gpu-operator-install}/SKILL.md (76%) diff --git a/gpu-operator/.agents/skills/gpu-operator-install-ing-nvidia/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-install/SKILL.md similarity index 76% rename from gpu-operator/.agents/skills/gpu-operator-install-ing-nvidia/SKILL.md rename to gpu-operator/.agents/skills/gpu-operator-install/SKILL.md index 34f24953f..785d6ab6e 100644 --- a/gpu-operator/.agents/skills/gpu-operator-install-ing-nvidia/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-install/SKILL.md @@ -1,30 +1,74 @@ --- -name: "gpu-operator-install-ing-nvidia" -description: "Installs the NVIDIA GPU Operator in a Kubernetes cluster with Helm. Use when users are getting started, installing the Operator for the first time, or checking installation prerequisites. Trigger keywords - NVIDIA GPU Operator, installation, Helm, Kubernetes, getting started." +name: "gpu-operator-install" +description: "Installs the NVIDIA GPU Operator in a Kubernetes cluster with Helm. Use when users are getting started, installing the Operator for the first time, or checking installation prerequisites." +triggers: + - NVIDIA GPU Operator + - installation + - Helm + - Kubernetes + - getting started +tags: + - gpu-operator + - nvidia + - kubernetes + - gpu + - installation + - helm + - getting-started --- -# Prerequisites +# Installing the NVIDIA GPU Operator + +The current patch release of this version of the NVIDIA GPU Operator is `v26.3.1`. + +> [!TIP] +> For installation on Red Hat OpenShift Container Platform, refer to [OpenShift installation steps](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/steps-overview.html). + +## Prerequisites 1. You have the `kubectl` and `helm` CLIs available on a client machine. -# Installing the NVIDIA GPU Operator + You can run the following commands to install the Helm CLI: + + ```console + $ curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \ + && chmod 700 get_helm.sh \ + && ./get_helm.sh + ``` + +1. If you are planning to use ClusterPolicy for driver configuration, all worker nodes or node groups to run GPU workloads in the Kubernetes cluster must run the same operating system version to use the NVIDIA GPU Driver container. + Alternatively, if you pre-install the NVIDIA GPU Driver on the nodes, then you can run different operating systems. -**Version:** + For worker nodes or node groups that run CPU workloads only, the nodes can run any operating system because the GPU Operator does not perform any configuration or management of nodes for CPU-only workloads. -The current patch release of this version of the NVIDIA GPU Operator is `${version}`. -**Red Hat OpenShift Container Platform Install:** + If you are planning to use the NVIDIA GPU Driver Custom Resource Definition, you can use a mix of operating system versions on CPU and GPU nodes. Refer to the NVIDIA GPU Driver Custom Resource Definition (use the `gpu-operator-nvidia-driver` skill) page for more information. -For installation on Red Hat OpenShift Container Platform, refer to :external+ocpsteps-overview. +1. Nodes must be configured with a container engine such as CRI-O or containerd. + +1. If your cluster uses Pod Security Admission (PSA) to restrict the behavior of pods, label the namespace for the Operator to set the enforcement policy to privileged: + + ```console + $ kubectl create ns gpu-operator + $ kubectl label --overwrite ns gpu-operator pod-security.kubernetes.io/enforce=privileged + ``` + +1. Node Feature Discovery (NFD) is a dependency for the Operator on each node. + By default, NFD master and worker are automatically deployed by the Operator. + If NFD is already running in the cluster, then you must disable deploying NFD when you install the Operator. + + One way to determine if NFD is already running in the cluster is to check for an NFD label on your nodes: + + ```console + $ kubectl get nodes -o json | jq '.items[].metadata.labels | keys | any(startswith("feature.node.kubernetes.io"))' + ``` -## Step 1: Procedure + If the command output is `true`, then NFD is already running in the cluster. -**Tip:** +## Procedure -For installation on Red Hat OpenShift Container Platform, -refer to :external+ocpsteps-overview. 1. Add the NVIDIA Helm repository: ```console @@ -40,7 +84,7 @@ refer to :external+ocpsteps-overview. $ helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ - --version=${version} + --version=v26.3.1 ``` - Install the Operator and specify configuration options: @@ -49,14 +93,13 @@ refer to :external+ocpsteps-overview. $ helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ - --version=${version} \ + --version=v26.3.1 \ --set = ``` - Refer to the gpu-operator-helm-chart-options - and common deployment scenarios for more information. + Refer to the **Common Chart Customization Options** and **Common Deployment Scenarios** sections below for more information. -## Step 2: Common Chart Customization Options +## Common Chart Customization Options The following options are available when using the Helm chart. These options can be used with `--set` when installing with Helm. @@ -67,29 +110,29 @@ To view all the options, run `helm show values nvidia/gpu-operator`. | Parameter | Description | Default | | --- | --- | --- | | `ccManager.enabled` | When set to `true`, the Operator deploys NVIDIA Confidential Computing Manager for Kubernetes. | `false` | -| `cdi.enabled` | When set to `true` (default), the Container Device Interface (CDI) will be used for injecting GPUs into workload containers. The Operator will no longer configure the `nvidia` runtime class as the default runtime handler. Instead, native-CDI support in container runtimes like containerd or cri-o will be leveraged for injecting GPUs into workload containers. Refer to the cdi page for more information. | `true` | -| `cdi.nriPluginEnabled` | When set to `true`, the Node Resource Interface (NRI) Plugin will be used for injecting GPUs into workload containers. In NRI Plugin mode, the NVIDIA Container Toolkit will no longer modify the runtime config. This feature requires containerd v1.7.30, v2.1.x, or v2.2.x. Refer to the cdi page for more information. | `false` | +| `cdi.enabled` | When set to `true` (default), the Container Device Interface (CDI) will be used for injecting GPUs into workload containers. The Operator will no longer configure the `nvidia` runtime class as the default runtime handler. Instead, native-CDI support in container runtimes like containerd or cri-o will be leveraged for injecting GPUs into workload containers. Refer to the Container Device Interface page (use the `gpu-operator-container-device` skill) for more information. | `true` | +| `cdi.nriPluginEnabled` | When set to `true`, the Node Resource Interface (NRI) Plugin will be used for injecting GPUs into workload containers. In NRI Plugin mode, the NVIDIA Container Toolkit will no longer modify the runtime config. This feature requires containerd v1.7.30, v2.1.x, or v2.2.x. Refer to the Container Device Interface page (use the `gpu-operator-container-device` skill) for more information. | `false` | | `cdi.default` Deprecated. | This field is deprecated as of v25.10.0 and will be ignored. The `cdi.enabled` field is set to `true` by default in versions 25.10.0 and later. When set to `true`, the container runtime uses CDI to perform device injection by default. | `false` | | `daemonsets.annotations` | Map of custom annotations to add to all GPU Operator managed pods. | `{}` | | `daemonsets.labels` | Map of custom labels to add to all GPU Operator managed pods. | `{}` | | `dcgmExporter.enabled` | By default, the Operator gathers GPU telemetry in Kubernetes using [DCGM Exporter](https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/dcgm-exporter.html). Set this value to `false` to disable it. Available values are `true` (default) or `false`. | `true` | | `dcgmExporter.service.internalTrafficPolicy` | Specifies the [internalTrafficPolicy](https://kubernetes.io/docs/concepts/services-networking/service/#traffic-policies) for the DCGM Exporter service. Available values are `Cluster` (default) or `Local`. | `Cluster` | | `dcgmExporter.hostNetwork` | When set to `true`, the DCGM Exporter will expose a metric port on the host's network namespace. | `false` | -| `devicePlugin.config` | Specifies the configuration for the NVIDIA Device Plugin as a config map. In most cases, this field is configured after installing the Operator, such as to configure gpu-sharing. | `{}` | +| `devicePlugin.config` | Specifies the configuration for the NVIDIA Device Plugin as a config map. In most cases, this field is configured after installing the Operator, such as to configure GPU time-slicing (use the `gpu-operator-timeslicing-gpus` skill). | `{}` | | `driver.enabled` | By default, the Operator deploys NVIDIA drivers as a container on the system. Set this value to `false` when using the Operator on systems with pre-installed drivers. | `true` | | `driver.image` | Name of the NVIDIA Driver Container image to use. | `driver` | | `driver.imagePullSecrets` | List of the image pull secret used for pulling the driver container image from the registry. | None | | `driver.kernelModuleType` | Specifies the type of the NVIDIA GPU Kernel modules to use. Valid values are `auto` (default), `proprietary`, and `open`. `Auto` means that the recommended kernel module type (open or proprietary) is chosen based on the GPU devices on the host and the driver branch used. The `auto` option is only supported with the 570.86.15 and 570.124.06 or later driver containers. 550 and 535 branch drivers do not yet support this mode. `Open` means the open kernel module is used. `Proprietary` means the proprietary module is used. | `auto` | | `driver.nvidiaDriverCRD.enabled` | When set to `true`, the Operator deploys NVIDIA GPU Driver Custom Resource Definition. Refer to the NVIDIA GPU Driver Custom Resource Definition (use the `gpu-operator-nvidia-driver` skill) page for more information. | `false` | | `driver.repository` | The images are downloaded from NGC. Specify another image repository when using custom driver images. | `nvcr.io/nvidia` | -| `driver.rdma.enabled` | Controls whether the driver daemon set builds and loads the legacy `nvidia-peermem` kernel module. You might be able to use GPUDirect RDMA without enabling this option. Refer to gpu-operator-rdma for information about whether you can use DMA-BUF or you need to use legacy `nvidia-peermem`. | `false` | +| `driver.rdma.enabled` | Controls whether the driver daemon set builds and loads the legacy `nvidia-peermem` kernel module. You might be able to use GPUDirect RDMA without enabling this option. Refer to the GPUDirect RDMA page (use the `gpu-operator-gpudirect-rdma` skill) for information about whether you can use DMA-BUF or you need to use legacy `nvidia-peermem`. | `false` | | `driver.rdma.useHostMofed` | Indicate if MLNX_OFED (MOFED) drivers are pre-installed on the host. | `false` | -| `driver.secretEnv` | The name of the secret to the driver container. A common use case is to use this field to pass your Ubuntu Pro token secret if you are deploying the GPU Operator with government-ready components. Refer to install-gpu-operator-gov-ready for more information. | None | +| `driver.secretEnv` | The name of the secret to the driver container. A common use case is to use this field to pass your Ubuntu Pro token secret if you are deploying the GPU Operator with government-ready components. Refer to the government-ready installation page (use the `gpu-operator-install-governmentready-environments` skill) for more information. | None | | `driver.startupProbe` | By default, the driver container has an initial delay of `60s` before starting liveness probes. The probe runs the `nvidia-smi` command with a timeout duration of `60s`. You can increase the `timeoutSeconds` duration if the `nvidia-smi` command runs slowly in your cluster. | `60s` | | `driver.useOpenKernelModules` Deprecated. | This field is deprecated as of v25.3.0 and will be ignored. Use `kernelModuleType` instead. When set to `true`, the driver containers install the NVIDIA Open GPU Kernel module driver. | `false` | | `driver.usePrecompiled` | When set to `true`, the Operator attempts to deploy driver containers that have precompiled kernel drivers. Refer to the precompiled driver containers (use the `gpu-operator-precompiled-drivers` skill) page for the supported operating systems. | `false` | -| `driver.version` | Version of the NVIDIA datacenter driver supported by the Operator. If you set `driver.usePrecompiled` to `true`, then set this field to a driver branch, such as `525`. | Depends on the version of the Operator. Refer to the GPU Operator Component Matrix for more information on supported drivers. | -| `gdrcopy.enabled` | Enables support for GDRCopy. When set to `true`, the GDRCopy Driver runs as a sidecar container in the GPU driver pod. For information about GDRCopy, refer to the [gdrcopy](https://developer.nvidia.com/gdrcopy) page. You can enable GDRCopy if you use the gpu-driver-configuration. | `false` | +| `driver.version` | Version of the NVIDIA datacenter driver supported by the Operator. If you set `driver.usePrecompiled` to `true`, then set this field to a driver branch, such as `525`. | Depends on the version of the Operator. Refer to the [GPU Operator Component Matrix](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/life-cycle-policy.html#gpu-operator-component-matrix) for more information on supported drivers. | +| `gdrcopy.enabled` | Enables support for GDRCopy. When set to `true`, the GDRCopy Driver runs as a sidecar container in the GPU driver pod. For information about GDRCopy, refer to the [gdrcopy](https://developer.nvidia.com/gdrcopy) page. You can enable GDRCopy if you use the NVIDIA GPU Driver custom resource (use the `gpu-operator-nvidia-driver` skill). | `false` | | `mig.strategy` | Controls the strategy to be used with MIG on supported NVIDIA GPUs. Options are either `mixed` or `single`. | `single` | | `migManager.enabled` | The MIG manager watches for changes to the MIG geometry and applies reconfiguration as needed. By default, the MIG manager only runs on nodes with GPUs that support MIG (such as the A100). | `true` | | `nfd.enabled` | Deploys Node Feature Discovery plugin as a daemonset. Set this variable to `false` if NFD is already running in the cluster. | `true` | @@ -101,7 +144,7 @@ To view all the options, run `helm show values nvidia/gpu-operator`. | `sandboxWorkloads.mode` | Specifies the sandbox mode to use when deploying sandbox workloads. Accepted values are `kubevirt` (default) and `kata`. Refer to the KubeVirt (use the `gpu-operator-kubevirt` skill) or the Kata Containers (use the `gpu-operator-kata-containers` skill) pages for more information on using KubeVirt or Kata based workloads. | `kubevirt` | | `toolkit.enabled` | By default, the Operator deploys the NVIDIA Container Toolkit (`nvidia-docker2` stack) as a container on the system. Set this value to `false` when using the Operator on systems with pre-installed NVIDIA runtimes. | `true` | -## Step 3: Common Deployment Scenarios +## Common Deployment Scenarios The following common deployment scenarios and sample commands apply best to bare metal hosts or virtual machines with GPU passthrough. @@ -116,7 +159,7 @@ For example, to install the GPU Operator in the `nvidia-gpu-operator` namespace: $ helm install --wait --generate-name \ -n nvidia-gpu-operator --create-namespace \ nvidia/gpu-operator \ - --version=${version} \ + --version=v26.3.1 \ ``` If you do not specify a namespace during installation, all GPU Operator components are installed in the `default` namespace. @@ -150,13 +193,13 @@ In this scenario, use the NVIDIA Container Toolkit image that is built on UBI 8: $ helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ - --version=${version} \ + --version=v26.3.1 \ --set toolkit.version=v1.16.1-ubi8 ``` Replace the `v1.16.1` value in the preceding command with the version that is supported with the NVIDIA GPU Operator. -Refer to the GPU Operator Component Matrix on the platform support page. +Refer to the [GPU Operator Component Matrix](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/life-cycle-policy.html#gpu-operator-component-matrix) on the platform support page. When using RHEL8 with Kubernetes, SELinux must be enabled either in permissive or enforcing mode for use with the GPU Operator. Additionally, when using RHEL8 with containerd as the runtime and SELinux is enabled (either in permissive or enforcing mode) at the host level, containerd must also be configured for SELinux, by setting the `enable_selinux=true` configuration option. @@ -170,7 +213,7 @@ In this scenario, the NVIDIA GPU driver is already installed on the worker nodes $ helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ - --version=${version} \ + --version=v26.3.1 \ --set driver.enabled=false ``` @@ -184,11 +227,10 @@ The Operator proceeds to start other pods, such as the container toolkit pod. In this scenario, the NVIDIA GPU driver and the NVIDIA Container Toolkit are already installed on the worker nodes that have GPUs. -**Tip:** - -This scenario applies to NVIDIA DGX Systems that run NVIDIA Base OS. -Before installing the Operator, ensure that the default runtime is set to `nvidia`. -Refer to :external+ctkconfiguration in the NVIDIA Container Toolkit documentation for more information. +> [!TIP] +> This scenario applies to NVIDIA DGX Systems that run NVIDIA Base OS. +> Before installing the Operator, ensure that the default runtime is set to `nvidia`. +> Refer to the [NVIDIA Container Toolkit configuration](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) documentation for more information. Install the Operator with the following options: @@ -196,7 +238,7 @@ Install the Operator with the following options: $ helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ - --version=${version} \ + --version=v26.3.1 \ --set driver.enabled=false \ --set toolkit.enabled=false ``` @@ -217,7 +259,7 @@ In this scenario, the NVIDIA Container Toolkit is already installed on the worke $ helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ - --version=${version} \ + --version=v26.3.1 \ --set toolkit.enabled=false ``` @@ -245,7 +287,7 @@ you can build a custom driver container image. Follow these steps: $ helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ - --version=${version} \ + --version=v26.3.1 \ --set driver.repository=docker.io/nvidia \ --set driver.version="465.27" ``` @@ -254,15 +296,14 @@ These instructions are provided for reference and evaluation purposes. Not using the standard releases of the GPU Operator from NVIDIA would mean limited support for such custom configurations. -## Step 4: Specifying Configuration Options for containerd - -**Note:** +## Specifying Configuration Options for containerd -It's recommended that you enable the NRI Plugin to configure the container runtime by setting `cdi.nriPluginEnabled=true`. -When enabled, you do not need to specify the `toolkit.env` options and injecting GPUs into workload containers is handled by the NRI Plugin. -Refer to the NRI Plugin documentation, for more information. -When you use containerd as the container runtime, the following configuration -options are used with the container-toolkit deployed with GPU Operator: +> [!NOTE] +> It's recommended that you enable the NRI Plugin to configure the container runtime by setting `cdi.nriPluginEnabled=true`. +> When enabled, you do not need to specify the `toolkit.env` options and injecting GPUs into workload containers is handled by the NRI Plugin. +> Refer to the Container Device Interface and NRI page (use the `gpu-operator-container-device` skill) for more information. +> When you use containerd as the container runtime, the following configuration +> options are used with the container-toolkit deployed with GPU Operator: ```yaml toolkit: @@ -280,7 +321,7 @@ If you need to specify custom values, refer to the following sample command for ```console helm install gpu-operator -n gpu-operator --create-namespace \ nvidia/gpu-operator $HELM_OPTIONS \ - --version=${version} \ + --version=v26.3.1 \ --set toolkit.env[0].name=CONTAINERD_CONFIG \ --set toolkit.env[0].value=/etc/containerd/containerd.toml \ --set toolkit.env[1].name=CONTAINERD_SOCKET \ @@ -328,7 +369,7 @@ in the RKE2 documentation. It's recommended that you enable CDI (default) and the NRI Plugin on RKE. With both features enabled, you do not need to set `runtimeClassName: nvidia` in your pod spec. -Refer to the v24.9.0-known-limitations. +Refer to the [v24.9.0 known limitations](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/release-notes.html) in the release notes. ### MicroK8s @@ -350,7 +391,7 @@ These options can be passed to GPU Operator during install time as below. ```console helm install gpu-operator -n gpu-operator --create-namespace \ nvidia/gpu-operator $HELM_OPTIONS \ - --version=${version} \ + --version=v26.3.1 \ --set toolkit.env[0].name=CONTAINERD_CONFIG \ --set toolkit.env[0].value=/var/snap/microk8s/current/args/containerd-template.toml \ --set toolkit.env[1].name=CONTAINERD_SOCKET \ @@ -359,7 +400,7 @@ helm install gpu-operator -n gpu-operator --create-namespace \ --set-string toolkit.env[2].value=file=/var/snap/microk8s/current/args/containerd.toml ``` -## Step 5: Verification: Running Sample GPU Applications +## Verification: Running Sample GPU Applications ### CUDA VectorAdd @@ -425,6 +466,44 @@ You can perform the following steps to deploy Jupyter Notebook in your cluster: 1. Create a file, such as `tf-notebook.yaml`, with contents like the following example: + ```yaml + --- + apiVersion: v1 + kind: Service + metadata: + name: tf-notebook + labels: + app: tf-notebook + spec: + type: NodePort + ports: + - port: 80 + name: http + targetPort: 8888 + nodePort: 30001 + selector: + app: tf-notebook + --- + apiVersion: v1 + kind: Pod + metadata: + name: tf-notebook + labels: + app: tf-notebook + spec: + securityContext: + fsGroup: 0 + containers: + - name: tf-notebook + image: tensorflow/tensorflow:latest-gpu-jupyter + resources: + limits: + nvidia.com/gpu: 1 + ports: + - containerPort: 8888 + name: notebook + ``` + 1. Apply the manifest to deploy the pod and start the service: ```console @@ -484,10 +563,10 @@ You can perform the following steps to deploy Jupyter Notebook in your cluster: The notebook should now be accessible from your browser at this URL: [http://your-machine-ip:30001/?token=3660c9ee9b225458faaf853200bc512ff2206f635ab2b1d9](http://your-machine-ip:30001/?token=3660c9ee9b225458faaf853200bc512ff2206f635ab2b1d9). -## Step 6: Installation on Commercially Supported Kubernetes Platforms +## Installation on Commercially Supported Kubernetes Platforms | Product | Documentation | | --- | --- | -| Red Hat OpenShift 4 using RHCOS worker nodes | :external+ocpindex | -| VMware vSphere Kubernetes Service and NVIDIA AI Enterprise | nvaie-tanzu_ | -| Google Cloud Anthos | :external+edgeanthos-guide | +| Red Hat OpenShift 4 using RHCOS worker nodes | [NVIDIA GPU Operator on Red Hat OpenShift](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/index.html) | +| VMware vSphere Kubernetes Service and NVIDIA AI Enterprise | [NVIDIA AI Enterprise VMware vSphere Deployment Guide](https://docs.nvidia.com/ai-enterprise/deployment-guide-vmware/0.1.0/index.html) | +| Google Cloud Anthos | [Google Cloud Anthos guide](https://docs.nvidia.com/datacenter/cloud-native/edge/latest/anthos-guide.html) | diff --git a/gpu-operator/.agents/skills/gpu-operator-references/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-references/SKILL.md index fddafd1a3..855374e42 100644 --- a/gpu-operator/.agents/skills/gpu-operator-references/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-references/SKILL.md @@ -1,6 +1,22 @@ --- name: "gpu-operator-references" -description: "Points users to the Confidential Containers reference architecture and deployment documentation. Use when users ask about confidential GPU workloads or Confidential Containers with the GPU Operator. Trigger keywords - NVIDIA GPU Operator, Confidential Containers, sandboxed workloads, Kubernetes, life cycle policy, support, releases, overview, GPU workloads, platform support, operating systems, release notes, component versions, changelog, security, deployment, troubleshooting, diagnostics." +description: "Loads NVIDIA GPU Operator reference material on demand: overview, security, life-cycle policy, platform support, release notes, troubleshooting, and confidential-containers deployment. Use when users ask conceptual or reference questions about the GPU Operator that are not tied to a specific install or upgrade flow." +triggers: + - GPU Operator overview + - platform support + - release notes + - life cycle policy + - security considerations + - troubleshooting + - Confidential Containers +tags: + - gpu-operator + - nvidia + - kubernetes + - gpu + - reference + - overview + - troubleshooting --- diff --git a/gpu-operator/.agents/skills/gpu-operator-references/references/life-cycle-policy.md b/gpu-operator/.agents/skills/gpu-operator-references/references/life-cycle-policy.md index 386ca6346..9c753e44a 100644 --- a/gpu-operator/.agents/skills/gpu-operator-references/references/life-cycle-policy.md +++ b/gpu-operator/.agents/skills/gpu-operator-references/references/life-cycle-policy.md @@ -16,9 +16,8 @@ All prior major versions enter end of support and are no longer supported and do The product life cycle and versioning are subject to change in the future. -**Note:** - -Upgrades are only supported within a major release or to the next major release. +> [!NOTE] +> Upgrades are only supported within a major release or to the next major release. | GPU Operator Version | Status | | --- | --- | | 26.3.x | Supported | @@ -31,9 +30,8 @@ The following table shows the operands and default operand versions that corresp When post-release testing confirms support for newer versions of operands, these updates are identified as *recommended updates* to a GPU Operator version. Refer to Upgrading the NVIDIA GPU Operator for more information. -**Note:** - -All the following components are supported as government-ready in the NVIDIA GPU Operator v26.3, except for NVIDIA GDS Driver, NVIDIA Confidential Computing Manager, and NVIDIA GDRCopy Driver. +> [!NOTE] +> All the following components are supported as government-ready in the NVIDIA GPU Operator v26.3, except for NVIDIA GDS Driver, NVIDIA Confidential Computing Manager, and NVIDIA GDRCopy Driver. **D** = Default driver, **R** = Recommended driver | 1 Component | 1 GPU Operator Version | | @@ -55,10 +53,9 @@ All the following components are supported as government-ready in the NVIDIA GPU | NVIDIA Confidential Computing Manager for Kubernetes | [v0.3.0](https://github.com/NVIDIA/k8s-cc-manager/releases) | [v0.4.0](https://github.com/NVIDIA/k8s-cc-manager/releases) | | NVIDIA GDRCopy Driver | [v2.5.1](https://github.com/NVIDIA/gdrcopy/releases) | [v2.5.2](https://github.com/NVIDIA/gdrcopy/releases) | | NVIDIA Kata Sandbox Device Plugin | [v0.0.2](https://github.com/NVIDIA/sandbox-device-plugin/releases) | [v0.0.3](https://github.com/NVIDIA/sandbox-device-plugin/releases) | -**Note:** - -- Driver version could be different with NVIDIA vGPU, as it depends on the driver - version downloaded from the [NVIDIA Licensing Portal](https://ui.licensing.nvidia.com). -- The GPU Operator is supported on all active NVIDIA data center production drivers. - Refer to [Supported Drivers and CUDA Toolkit Versions](https://docs.nvidia.com/datacenter/tesla/drivers/index.html#supported-drivers-and-cuda-toolkit-versions) - for more information. +> [!NOTE] +> - Driver version could be different with NVIDIA vGPU, as it depends on the driver +> version downloaded from the [NVIDIA Licensing Portal](https://ui.licensing.nvidia.com). +> - The GPU Operator is supported on all active NVIDIA data center production drivers. +> Refer to [Supported Drivers and CUDA Toolkit Versions](https://docs.nvidia.com/datacenter/tesla/drivers/index.html#supported-drivers-and-cuda-toolkit-versions) +> for more information. diff --git a/gpu-operator/.agents/skills/gpu-operator-references/references/overview.md b/gpu-operator/.agents/skills/gpu-operator-references/references/overview.md index 0c05983a4..b7017eb19 100644 --- a/gpu-operator/.agents/skills/gpu-operator-references/references/overview.md +++ b/gpu-operator/.agents/skills/gpu-operator-references/references/overview.md @@ -2,7 +2,8 @@ # About the NVIDIA GPU Operator -![](graphics/nvidia-gpu-operator-image.jpg) +![NVIDIA GPU Operator architecture](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/_images/nvidia-gpu-operator-image.jpg) + Kubernetes provides access to special hardware resources such as NVIDIA GPUs, NICs, Infiniband adapters and other devices through the [device plugin framework](https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/). However, configuring and managing nodes with these hardware resources requires @@ -16,20 +17,20 @@ automatic node labeling using [GFD](https://github.com/NVIDIA/gpu-feature-discov Browse through the following documents for getting started, platform support and release notes for the NVIDIA GPU Operator. -**Red Hat OpenShift Container Platform:** +> [!TIP] +> For Red Hat OpenShift Container Platform, refer to [NVIDIA GPU Operator on Red Hat OpenShift](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/index.html) for information about installing, managing, and upgrading the Operator. -Refer to :external+ocpindex for information about installing, managing, and upgrading the Operator on Red Hat OpenShift Container Platform. ### Getting Started -The operator-install-guide guide includes information on installing the GPU Operator in a Kubernetes cluster. +For installing the GPU Operator in a Kubernetes cluster, use the `gpu-operator-install` skill. ### Release Notes -Refer to operator-release-notes for information about releases. +For information about releases, see the release notes (use the `gpu-operator-references` skill and load `references/release-notes.md`). ### Platform Support -The operator-platform-support describes the supported platform configurations. +For the supported platform configurations, see platform support (use the `gpu-operator-references` skill and load `references/platform-support.md`). ## Licenses and Contributing @@ -40,25 +41,25 @@ more information on how to contribute and the release artifacts. The base images used by the software might include software that is licensed under open-source licenses such as GPL. The source code for these components is archived on the CUDA opensource [index](https://developer.download.nvidia.com/compute/cuda/opensource/). -The following table identifieis the licenses for the Operator and software components. +The following table identifies the licenses for the Operator and software components. By installing and using the GPU Operator, you accept the terms and conditions of these licenses. | Component | Artifact Type | Artifact Licenses | | --- | --- | --- | | NVIDIA GPU Operator | Helm Chart | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) | -| NVIDIA GPU Operator | Image | pstai_ | -| NVIDIA GPU Feature Discovery | Image | pstai_ | -| NVIDIA GPU Driver | Image | [License for Customer Use of NVIDIA Software](http://www.nvidia.com/content/DriverDownload-March2009/licence.php?lang=us) pstai_ | -| NVIDIA Container Toolkit | Image | pstai_ | -| NVIDIA Kubernetes Device Plugin | Image | pstai_ | -| NVIDIA MIG Manager for Kubernetes | Image | pstai_ | -| Validator for NVIDIA GPU Operator | Image | pstai_ | -| NVIDIA DCGM | Image | pstai_ | -| NVIDIA DCGM Exporter | Image | pstai_ | -| NVIDIA Driver Manager for Kubernetes | Image | pstai_ | -| NVIDIA KubeVirt GPU Device Plugin | Image | pstai_ | -| NVIDIA vGPU Device Manager | Image | pstai_ | -| NVIDIA GDS Driver | Image | [License for Customer Use of NVIDIA Software](http://www.nvidia.com/content/DriverDownload-March2009/licence.php?lang=us) pstai_ | -| NVIDIA Confidential Computing Manager for Kubernetes | Image | pstai_ | -| NVIDIA Kata Manager for Kubernetes | Image | pstai_ | -| NVIDIA GDRCopy Driver | Image | pstai_ | +| NVIDIA GPU Operator | Image | [Product-Specific Terms for NVIDIA AI Products](https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/) | +| NVIDIA GPU Feature Discovery | Image | [Product-Specific Terms for NVIDIA AI Products](https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/) | +| NVIDIA GPU Driver | Image | [License for Customer Use of NVIDIA Software](http://www.nvidia.com/content/DriverDownload-March2009/licence.php?lang=us)
[Product-Specific Terms for NVIDIA AI Products](https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/) | +| NVIDIA Container Toolkit | Image | [Product-Specific Terms for NVIDIA AI Products](https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/) | +| NVIDIA Kubernetes Device Plugin | Image | [Product-Specific Terms for NVIDIA AI Products](https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/) | +| NVIDIA MIG Manager for Kubernetes | Image | [Product-Specific Terms for NVIDIA AI Products](https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/) | +| Validator for NVIDIA GPU Operator | Image | [Product-Specific Terms for NVIDIA AI Products](https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/) | +| NVIDIA DCGM | Image | [Product-Specific Terms for NVIDIA AI Products](https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/) | +| NVIDIA DCGM Exporter | Image | [Product-Specific Terms for NVIDIA AI Products](https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/) | +| NVIDIA Driver Manager for Kubernetes | Image | [Product-Specific Terms for NVIDIA AI Products](https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/) | +| NVIDIA KubeVirt GPU Device Plugin | Image | [Product-Specific Terms for NVIDIA AI Products](https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/) | +| NVIDIA vGPU Device Manager | Image | [Product-Specific Terms for NVIDIA AI Products](https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/) | +| NVIDIA GDS Driver | Image | [License for Customer Use of NVIDIA Software](http://www.nvidia.com/content/DriverDownload-March2009/licence.php?lang=us)
[Product-Specific Terms for NVIDIA AI Products](https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/) | +| NVIDIA Confidential Computing Manager for Kubernetes | Image | [Product-Specific Terms for NVIDIA AI Products](https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/) | +| NVIDIA Kata Manager for Kubernetes | Image | [Product-Specific Terms for NVIDIA AI Products](https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/) | +| NVIDIA GDRCopy Driver | Image | [Product-Specific Terms for NVIDIA AI Products](https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/) | diff --git a/gpu-operator/.agents/skills/gpu-operator-references/references/platform-support.md b/gpu-operator/.agents/skills/gpu-operator-references/references/platform-support.md index 7b96f6b8c..09d491422 100644 --- a/gpu-operator/.agents/skills/gpu-operator-references/references/platform-support.md +++ b/gpu-operator/.agents/skills/gpu-operator-references/references/platform-support.md @@ -73,10 +73,9 @@ Refer to Common Chart Customization Options for more information. NVIDIA A2 NVIDIA Ampere +-------------------------+---------------------------+-------+ -**Note:** - -* The GPU Operator supports DGX A100 with DGX OS 5.1+ and Red Hat OpenShift using Red Hat Core OS. - For installation instructions, see preinstalled-drivers-and-toolkit for DGX OS 5.1+ and openshift-introduction for Red Hat OpenShift. +> [!NOTE] +> * The GPU Operator supports DGX A100 with DGX OS 5.1+ and Red Hat OpenShift using Red Hat Core OS. +> For installation instructions, see preinstalled-drivers-and-toolkit for DGX OS 5.1+ and openshift-introduction for Red Hat OpenShift. ### D,T and V-series Products +-----------------------+------------------------+-------+ @@ -128,13 +127,13 @@ Refer to Common Chart Customization Options for more information. NVIDIA T400 Turing +-------------------------+------------------------+-------+ -**Note:** - -NVIDIA RTX PRO 6000 Blackwell Server Edition notes: +> [!NOTE] +> NVIDIA RTX PRO 6000 Blackwell Server Edition notes: +> +> * Driver versions 575.57.08 or later is required. +> * MIG is not supported on the 575.57.08 driver release. +> * In cases where CUDA init fails, you may need to disable Heterogeneous Memory Management (HMM) in UVM by customizing the NVIDIA GPU driver parameters during installation (use the `gpu-operator-custom-driver` skill). -* Driver versions 575.57.08 or later is required. -* MIG is not supported on the 575.57.08 driver release. -* In cases where CUDA init fails, you may need to disable Heterogeneous Memory Management (HMM) in UVM by Customizing NVIDIA GPU Driver Parameters during Installation. ### B-series Products +-------------------------+------------------------+-------+ @@ -159,9 +158,8 @@ NVIDIA RTX PRO 6000 Blackwell Server Edition notes: NVIDIA DGX Station NVIDIA Blackwell +-------------------------+------------------------+-------+ -**Note:** - -* HGX B200 requires a driver container version of 570.133.20 or later. +> [!NOTE] +> * HGX B200 requires a driver container version of 570.133.20 or later. ## Supported ARM Based Platforms The following NVIDIA data center GPUs are supported: @@ -191,10 +189,9 @@ system that meets the following requirements is supported: - A supported operating system such as Ubuntu or Red Hat Enterprise Linux. -**Note:** - -The GPU Operator only supports platforms using discrete GPUs. -NVIDIA Jetson, or other embedded products with integrated GPUs, are not supported. +> [!NOTE] +> The GPU Operator only supports platforms using discrete GPUs. +> NVIDIA Jetson, or other embedded products with integrated GPUs, are not supported. NVIDIA IGX Orin, a platform with an integrated GPU, is supported as long as the discrete GPU is the device being used. ## Supported Deployment Options @@ -211,9 +208,8 @@ The GPU Operator has been validated in the following scenarios: Virtual machines with NVIDIA vGPU based products +-----------------------------------------------------+ -**Note:** - -GPU Operator is supported with NVIDIA vGPU 12.0+. +> [!NOTE] +> GPU Operator is supported with NVIDIA vGPU 12.0+. ## Supported Operating Systems and Kubernetes Platforms The GPU Operator has been validated in the following scenarios: @@ -247,9 +243,9 @@ by the `unattended-upgrades` package to prevent an upgrade to an unsupported ker Non-precompiled driver containers for Red Hat Enterprise Linux 9.2, 9.4, 9.6, and 9.7 versions are available for x86 based platforms only. They are not available for ARM based systems. -**Note:** +> [!NOTE] +> Red Hat OpenShift Container Platform is supported on AWS, Azure, GCP, and OCI (Oracle) Virtual Machine or Bare Metal instances with T4, V100, L4, L40s, A10, A100, H100, and H200. -ocp_csp_support ### Cloud Service Providers | Operating System | Amazon EKS Kubernetes | Google GKE Kubernetes | @@ -295,10 +291,9 @@ The GPU Operator has been validated for the following container runtimes: Red Hat Enterprise Linux 9 Yes Yes +----------------------------+------------------------+----------------+ -**Note:** - -If you are planning to use the NRI Plugin, you must use containerd version v1.7.30+, v2.1.x and v2.2.x. -The NRI Plugin is not supported with CRI-O. +> [!NOTE] +> If you are planning to use the NRI Plugin, you must use containerd version v1.7.30+, v2.1.x and v2.2.x. +> The NRI Plugin is not supported with CRI-O. ## Support for KubeVirt and OpenShift Virtualization Red Hat OpenShift Virtualization is based on KubeVirt. @@ -342,9 +337,8 @@ KubeVirt and OpenShift Virtualization with NVIDIA vGPU is supported on the follo - NVIDIA HGX GB200 NVL72, GB300 NVL72 on Ubuntu 24.04 LTS. -**Note:** - -KubeVirt with NVIDIA vGPU is supported on `nodes` with Linux kernel < 6.0, such as Ubuntu 22.04 `LTS`. +> [!NOTE] +> KubeVirt with NVIDIA vGPU is supported on `nodes` with Linux kernel < 6.0, such as Ubuntu 22.04 `LTS`. ## Support for GPUDirect RDMA Supported operating systems and NVIDIA GPU Drivers with GPUDirect RDMA. @@ -370,14 +364,13 @@ Supported operating systems and NVIDIA GPU Drivers with GPUDirect Storage. - Ubuntu 20.04 and 22.04 LTS with Network Operator 25.7.0. - Red Hat OpenShift Container Platform 4.17 and higher. -**Note:** - -Version v2.17.5 and higher of the NVIDIA GPUDirect Storage kernel driver, `nvidia-fs`, -requires the NVIDIA Open GPU Kernel module driver. -You can install the open kernel modules by specifying the `driver.kernelModuleType=auto` if you are using driver container version 570.86.15, 570.124.06 or later. -Or use `driver.kernelModuleType=open` if you are using a different driver version or branch. -argument to the `helm` command. -Refer to Common Chart Customization Options for more information. +> [!NOTE] +> Version v2.17.5 and higher of the NVIDIA GPUDirect Storage kernel driver, `nvidia-fs`, +> requires the NVIDIA Open GPU Kernel module driver. +> You can install the open kernel modules by specifying the `driver.kernelModuleType=auto` if you are using driver container version 570.86.15, 570.124.06 or later. +> Or use `driver.kernelModuleType=open` if you are using a different driver version or branch. +> argument to the `helm` command. +> Refer to Common Chart Customization Options for more information. Not supported with secure boot. Supported storage types are local NVMe and remote NFS. @@ -392,7 +385,6 @@ Orchestration & resource scheduling: * [NVIDIA Run:ai](https://run-ai-docs.nvidia.com/) -**Note:** - -Run:ai requires the GPU Operator as a prerequisite and works with default GPU Operator settings. -Running the GPU Operator with Container Device Interface (CDI) enabled (default in v25.10.0 and later) requires Run:ai v2.24.38 and later, or v2.23.35 and later. Refer to the Run:ai [cluster requirements documentation](https://run-ai-docs.nvidia.com/self-hosted/getting-started/installation/install-using-helm/system-requirements#nvidia-gpu-operator) for more information. +> [!NOTE] +> Run:ai requires the GPU Operator as a prerequisite and works with default GPU Operator settings. +> Running the GPU Operator with Container Device Interface (CDI) enabled (default in v25.10.0 and later) requires Run:ai v2.24.38 and later, or v2.23.35 and later. Refer to the Run:ai [cluster requirements documentation](https://run-ai-docs.nvidia.com/self-hosted/getting-started/installation/install-using-helm/system-requirements#nvidia-gpu-operator) for more information. diff --git a/gpu-operator/.agents/skills/gpu-operator-references/references/release-notes.md b/gpu-operator/.agents/skills/gpu-operator-references/references/release-notes.md index 428057245..48bb5f13a 100644 --- a/gpu-operator/.agents/skills/gpu-operator-references/references/release-notes.md +++ b/gpu-operator/.agents/skills/gpu-operator-references/references/release-notes.md @@ -6,10 +6,8 @@ This document describes the new features, improvements, fixed issues, and known Refer to the GPU Operator Component Matrix for a list of software components and versions included in each release. -**Note:** - -GPU Operator beta releases are documented on [GitHub](https://github.com/NVIDIA/gpu-operator/releases). NVIDIA AI Enterprise builds are not posted on GitHub. ----- +> [!NOTE] +> GPU Operator beta releases are documented on [GitHub](https://github.com/NVIDIA/gpu-operator/releases). NVIDIA AI Enterprise builds are not posted on GitHub. ## 26.3.1 @@ -77,9 +75,9 @@ GPU Operator beta releases are documented on [GitHub](https://github.com/NVIDIA/ To learn more, refer to Container Device Interface (CDI) and Node Resource Interface (NRI) Plugin Support (use the `gpu-operator-container-device` skill). - **Note:** + > [!NOTE] + > Enabling the NRI plugin is not supported with cri-o. - Enabling the NRI plugin is not supported with cri-o. * Added support for dynamic MIG config generation. By default, the MIG Manager will automatically generate a per-node ConfigMap with the default MIG profiles for the available GPUs on the node. This replaces the previous static ConfigMap. @@ -92,11 +90,10 @@ GPU Operator beta releases are documented on [GitHub](https://github.com/NVIDIA/ Use this feature on new cluster installations to configure multiple driver types and versions on different nodes or multiple operating system versions on nodes. Refer to the NVIDIA Driver Custom Resource Definition documentation (use the `gpu-operator-nvidia-driver` skill) for more information. - **Note:** - - This feature does not support an upgrade from an earlier version of the NVIDIA GPU Operator or switching from ClusterPolicy to the NVIDIA Driver CRD. - It is recommended that you only use this feature from new installations. -* Added support for KubeVirt with GPU passthrough on Ubuntu 24.04 LTS + > [!NOTE] + > This feature does not support an upgrade from an earlier version of the NVIDIA GPU Operator or switching from ClusterPolicy to the NVIDIA Driver CRD. + > It is recommended that you only use this feature from new installations. + > * Added support for KubeVirt with GPU passthrough on Ubuntu 24.04 LTS * Added support for K3s. @@ -494,14 +491,13 @@ GPU Operator beta releases are documented on [GitHub](https://github.com/NVIDIA/ * Starting with version **580.65.06**, the driver container has **Coherent Driver Memory Management (CDMM)** enabled by default to support **GB200** on Kubernetes. For more information about CDMM, refer to the [release notes](https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-580-65-06/index.html#hardware-software-support). - **Note:** - - Currently, CDMM is not compatible with the **Multi-Instance GPUs (MIG)** sharing. - CDMM is also not compatible with **GPU Direct Storage**. - CDMM support for these features is planned for future driver updates. - However, these limitations will remain in place until a future driver update removes them. - CDMM enablement applies only to **Grace-based systems** such as **GH200** and **GB200** and is ignored on other GPU platforms. - NVIDIA strongly recommends keeping CDMM enabled with Kubernetes on supported systems to prevent memory over-reporting and uncontrolled GPU memory access. + > [!NOTE] + > Currently, CDMM is not compatible with the **Multi-Instance GPUs (MIG)** sharing. + > CDMM is also not compatible with **GPU Direct Storage**. + > CDMM support for these features is planned for future driver updates. + > However, these limitations will remain in place until a future driver update removes them. + > CDMM enablement applies only to **Grace-based systems** such as **GH200** and **GB200** and is ignored on other GPU platforms. + > NVIDIA strongly recommends keeping CDMM enabled with Kubernetes on supported systems to prevent memory over-reporting and uncontrolled GPU memory access. * For drivers 570.124.06, 570.133.20, 570.148.08, and 570.158.01, GPU workloads cannot be scheduled on nodes that have a mix of MIG slices and full GPUs. @@ -2238,16 +2234,15 @@ These CVEs are from the base images and are not in libraries that are used by th This allows the GPU Operator to complement the [NVIDIA Network Operator](https://github.com/Mellanox/network-operator) to enable GPUDirect RDMA in the Kubernetes cluster. Refer to the RDMA documentation on getting started. - **Note:** - - This feature is available only when used with R470 drivers on Ubuntu 20.04 LTS. -* Added support for upgrades of the GPU Operator components. A new `k8s-driver-manager` component handles upgrades - of the NVIDIA drivers on nodes in the cluster. -* NVIDIA DCGM is now deployed as a component of the GPU Operator. The standalone DCGM container allows multiple clients such as - [DCGM-Exporter](https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/dcgm-exporter.html) and [NVSM](https://docs.nvidia.com/nvidia-system-management-nvsm/) - to be deployed and connect to the existing DCGM container. -* Added a `nodeStatusExporter` component that exports operator and node metrics in a Prometheus format. The component provides - information on the status of the operator (e.g. reconciliation status, number of GPU enabled nodes). + > [!NOTE] + > This feature is available only when used with R470 drivers on Ubuntu 20.04 LTS. + > * Added support for upgrades of the GPU Operator components. A new `k8s-driver-manager` component handles upgrades + > of the NVIDIA drivers on nodes in the cluster. + > * NVIDIA DCGM is now deployed as a component of the GPU Operator. The standalone DCGM container allows multiple clients such as + > [DCGM-Exporter](https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/dcgm-exporter.html) and [NVSM](https://docs.nvidia.com/nvidia-system-management-nvsm/) + > to be deployed and connect to the existing DCGM container. + > * Added a `nodeStatusExporter` component that exports operator and node metrics in a Prometheus format. The component provides + > information on the status of the operator (e.g. reconciliation status, number of GPU enabled nodes). ### Improvements * Reduced the size of the ClusterPolicy CRD by removing duplicates and redundant fields. @@ -2418,10 +2413,9 @@ DCGM-Exporter support includes the following: ### New Features * Added support for CentOS 7 and 8. - **Note:** - - Due to a known limitation with the GPU Operator's default values on CentOS, install the operator on CentOS 7/8 - using the following Helm command: + > [!NOTE] + > Due to a known limitation with the GPU Operator's default values on CentOS, install the operator on CentOS 7/8 + > using the following Helm command: ```console $ helm install --wait --generate-name \ diff --git a/gpu-operator/.agents/skills/gpu-operator-references/references/security.md b/gpu-operator/.agents/skills/gpu-operator-references/references/security.md index f621786fe..6c12ac66a 100644 --- a/gpu-operator/.agents/skills/gpu-operator-references/references/security.md +++ b/gpu-operator/.agents/skills/gpu-operator-references/references/security.md @@ -23,7 +23,7 @@ As a best practice, establish proper security policies and prevent any other use ## CVEs The following is a list of known CVEs in the GPU Operator or its operands. -To view any published security bulletins for NVIDIA products published security bulletins for NVIDIA products, refer to the NVIDIA product security page at https://www.nvidia.com/en-us/security/. +To view any published security bulletins for NVIDIA products, refer to the NVIDIA product security page at https://www.nvidia.com/en-us/security/. | CVE ID | Affected Components | Fixed Version | | --- | --- | --- | From 25272ee37bef3546529df0ee31e6b16db030a57a Mon Sep 17 00:00:00 2001 From: Andrew Chen Date: Mon, 1 Jun 2026 03:45:11 -0700 Subject: [PATCH 02/13] fix: global converter-defect pass across remaining 23 skills Applies the verified #401 systemic findings to all remaining skills: - Dropped Trigger keywords suffix; added triggers:/tags: frontmatter arrays sourced from each page's .. meta:: block (#13). - Removed Step N: H2/H3 prefixes (#14). - Converted flat-bold admonitions to GitHub alerts (#15). - Replaced ${version} leaks with v26.3.1. - Fixed 4 broken :external+ Sphinx roles to real OpenShift doc URLs (#7). - Restored 12 silently dropped .. literalinclude:: code blocks from the source manifests across nvidia-driver (4), timeslicing (3), gpudirect-rdma (2), google (2), multiinstance (1), amazon (1) (#2). - Fixed misspelled asset nvd-precomiled-some.yaml -> nvd-precompiled-some.yaml. Prerequisites/Verification sections and remaining cross-ref repairs land in the following per-skill commits. Co-Authored-By: Claude Opus 4.8 (1M context) Signed-off-by: Andrew Chen --- .../gpu-operator-container-device/SKILL.md | 47 +++--- .../gpu-operator-custom-driver/SKILL.md | 20 ++- .../gpu-operator-driver-upgrades/SKILL.md | 49 +++--- .../gpu-operator-gpudirect-rdma/SKILL.md | 98 ++++++++++- .../SKILL.md | 87 +++++----- .../SKILL.md | 34 ++-- .../gpu-operator-install-http-proxy/SKILL.md | 44 +++-- .../SKILL.md | 33 ++-- .../gpu-operator-install-nvidia-vgpu/SKILL.md | 31 ++-- .../SKILL.md | 19 ++- .../SKILL.md | 18 ++- .../gpu-operator-kata-containers/SKILL.md | 107 ++++++------ .../skills/gpu-operator-kubevirt/SKILL.md | 56 ++++--- .../gpu-operator-multiinstance/SKILL.md | 83 ++++++---- .../gpu-operator-nvidia-amazon/SKILL.md | 57 +++++-- .../skills/gpu-operator-nvidia-azure/SKILL.md | 20 ++- .../skills/gpu-operator-nvidia-dra/SKILL.md | 57 ++++--- .../gpu-operator-nvidia-driver/SKILL.md | 153 ++++++++++++++++-- .../gpu-operator-nvidia-google/SKILL.md | 56 ++++++- .../gpu-operator-precompiled-drivers/SKILL.md | 33 ++-- .../gpu-operator-timeslicing-gpus/SKILL.md | 119 ++++++++++++-- .../gpu-operator-uninstalling-nvidia/SKILL.md | 21 ++- .../gpu-operator-upgrading-nvidia/SKILL.md | 33 ++-- 23 files changed, 929 insertions(+), 346 deletions(-) diff --git a/gpu-operator/.agents/skills/gpu-operator-container-device/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-container-device/SKILL.md index 868b434c2..2eac638ab 100644 --- a/gpu-operator/.agents/skills/gpu-operator-container-device/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-container-device/SKILL.md @@ -1,6 +1,20 @@ --- name: "gpu-operator-container-device" -description: "Explains how to configure CDI and NRI support for GPU workloads. Use when enabling CDI, configuring containerd, or troubleshooting CDI-based GPU injection. Trigger keywords - NVIDIA GPU Operator, CDI, NRI, containerd, Kubernetes." +description: "Explains how to configure CDI and NRI support for GPU workloads. Use when enabling CDI, configuring containerd, or troubleshooting CDI-based GPU injection." +triggers: + - NVIDIA GPU Operator + - CDI + - NRI + - containerd + - Kubernetes +tags: + - gpu-operator + - nvidia + - kubernetes + - gpu + - cdi + - nri + - containerd --- @@ -37,12 +51,11 @@ Examples of GPU Management Containers include monitoring agents and device plugi It is recommended that `NVIDIA_VISIBLE_DEVICES` only be used by GPU Management Containers. -**Note:** +> [!NOTE] +> Setting `runtimeClassName: nvidia` in the pod specification is not required when the NRI Plugin is enabled in GPU Operator. +> Refer to About the Node Resource Interface (NRI) Plugin. -Setting `runtimeClassName: nvidia` in the pod specification is not required when the NRI Plugin is enabled in GPU Operator. -Refer to About the Node Resource Interface (NRI) Plugin. - -## Step 1: Enabling CDI +## Enabling CDI CDI is enabled by default during installation in GPU Operator v25.10.0 and later. Follow the instructions for installing the Operator with Helm on the getting-started page. @@ -76,7 +89,7 @@ Use the following procedure to enable CDI if you disabled CDI during installatio *Example Output* -## Step 2: Disabling CDI +## Disabling CDI While CDI is the default and recommended mechanism for injecting GPU support into containers, you can disable CDI and use the legacy NVIDIA Container Toolkit stack instead with the following procedure: @@ -91,11 +104,10 @@ disable CDI and use the legacy NVIDIA Container Toolkit stack instead with the f --overwrite ``` - **Tip:** - - You can run `kubectl get nodes -o wide` and view the `CONTAINER-RUNTIME` - column to determine if your nodes use CRI-O. -1. Disable CDI by modifying the cluster policy: + > [!TIP] + > You can run `kubectl get nodes -o wide` and view the `CONTAINER-RUNTIME` + > column to determine if your nodes use CRI-O. + > 1. Disable CDI by modifying the cluster policy: ```console $ kubectl patch clusterpolicies.nvidia.com/cluster-policy --type='json' \ @@ -130,7 +142,7 @@ In previous GPU Operator versions, device injection was handled by the `nvidia` Additionally, with the NRI Plugin enabled, modifications to the container runtime configuration are no longer needed. For example, no modifications are made to containerd’s config.toml file. This means that on platforms that configure containerd in a non-standard way, like k3s, k0s, and Rancher Kubernetes Engine 2, users no longer need to configure environment variables like `CONTAINERD_CONFIG`, `CONTAINERD_SOCKET`, or `RUNTIME_CONFIG_SOURCE`. -## Step 3: Enabling the NRI Plugin +## Enabling the NRI Plugin The NRI Plugin requires the following: @@ -139,10 +151,9 @@ The NRI Plugin requires the following: - containerd v1.7.30, v2.1.x, or v2.2.x. If you are not using the latest containerd version, check that both CDI and NRI are enabled in the containerd configuration file before deploying GPU Operator. - **Note:** - - Enabling the NRI plugin is not supported with cri-o. -To enable the NRI Plugin during installation, follow the instructions for installing the Operator with Helm on the getting-started page and include the `--set cdi.nriPluginEnabled=true` argument in your Helm command. + > [!NOTE] + > Enabling the NRI plugin is not supported with cri-o. + > To enable the NRI Plugin during installation, follow the instructions for installing the Operator with Helm on the getting-started page and include the `--set cdi.nriPluginEnabled=true` argument in your Helm command. ### Enabling the NRI Plugin After Installation @@ -169,7 +180,7 @@ To enable the NRI Plugin during installation, follow the instructions for instal *Example Output* -## Step 4: Disabling the NRI Plugin +## Disabling the NRI Plugin Disable the NRI Plugin and use the `nvidia` runtime class instead with the following procedure: diff --git a/gpu-operator/.agents/skills/gpu-operator-custom-driver/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-custom-driver/SKILL.md index f51403551..59d263d88 100644 --- a/gpu-operator/.agents/skills/gpu-operator-custom-driver/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-custom-driver/SKILL.md @@ -1,6 +1,18 @@ --- name: "gpu-operator-custom-driver" -description: "Shows how to provide custom NVIDIA driver parameters to GPU Operator driver containers. Use when changing driver module options or customizing driver container behavior. Trigger keywords - NVIDIA GPU Operator, driver parameters, NVIDIA driver, configuration." +description: "Shows how to provide custom NVIDIA driver parameters to GPU Operator driver containers. Use when changing driver module options or customizing driver container behavior." +triggers: + - NVIDIA GPU Operator + - driver parameters + - NVIDIA driver + - configuration +tags: + - gpu-operator + - nvidia + - kubernetes + - gpu + - driver + - configuration --- @@ -14,7 +26,7 @@ On a machine with the driver already installed, you can list the parameter names You can pass custom parameters to the kernel modules that get loaded as part of the NVIDIA Driver installation (`nvidia`, `nvidia-modeset`, `nvidia-uvm`, and `nvidia-peermem`). -## Step 1: Configure Custom Driver Parameters +## Configure Custom Driver Parameters To pass custom parameters, execute the following steps. @@ -42,7 +54,7 @@ To pass custom parameters, execute the following steps. $ helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ - --version=${version} \ + --version=v26.3.1 \ --set driver.kernelModuleConfig.name="kernel-module-params" ``` @@ -72,7 +84,7 @@ Refer to [Simplifying GPU Application Development with Heterogeneous Memory Mana $ helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ - --version=${version} \ + --version=v26.3.1 \ --set driver.kernelModuleConfig.name="kernel-module-params" ``` diff --git a/gpu-operator/.agents/skills/gpu-operator-driver-upgrades/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-driver-upgrades/SKILL.md index 1bc83791b..49e341c54 100644 --- a/gpu-operator/.agents/skills/gpu-operator-driver-upgrades/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-driver-upgrades/SKILL.md @@ -1,6 +1,18 @@ --- name: "gpu-operator-driver-upgrades" -description: "Explains GPU driver upgrade behavior and configuration. Use when planning driver upgrades or troubleshooting driver upgrade workflows managed by the GPU Operator. Trigger keywords - NVIDIA GPU Operator, GPU driver, driver upgrades, Kubernetes." +description: "Explains GPU driver upgrade behavior and configuration. Use when planning driver upgrades or troubleshooting driver upgrade workflows managed by the GPU Operator." +triggers: + - NVIDIA GPU Operator + - GPU driver + - driver upgrades + - Kubernetes +tags: + - gpu-operator + - nvidia + - kubernetes + - gpu + - driver + - upgrades --- @@ -21,12 +33,11 @@ Consequently, the following steps must occur across a driver upgrade: The GPU Operator supports several methods for managing and automating this driver upgrade process. -**Note:** +> [!NOTE] +> The GPU Operator only manages the lifecycle of containerized drivers. +> Drivers which are pre-installed on the host are not managed by the GPU Operator. -The GPU Operator only manages the lifecycle of containerized drivers. -Drivers which are pre-installed on the host are not managed by the GPU Operator. - -## Step 1: Upgrades with the Upgrade Controller +## Upgrades with the Upgrade Controller NVIDIA recommends upgrading by using the upgrade controller and the controller is enabled by default in the GPU Operator. The controller automates the upgrade process and generates metrics and events so that you can monitor the upgrade process. @@ -133,12 +144,11 @@ driver: deleteEmptyDir: false ``` -**Warning:** - -`driver.upgradePolicy.drain.enable` is a cluster-wide policy setting. -When set to `true`, the upgrade controller drains each node before upgrading the driver on that node. -Draining a node evicts all pods from that node, including workloads unrelated to the GPU driver. -This is a disruptive operation that interrupts running GPU and non-GPU workloads on every node the upgrade controller processes. +> [!WARNING] +> `driver.upgradePolicy.drain.enable` is a cluster-wide policy setting. +> When set to `true`, the upgrade controller drains each node before upgrading the driver on that node. +> Draining a node evicts all pods from that node, including workloads unrelated to the GPU driver. +> This is a disruptive operation that interrupts running GPU and non-GPU workloads on every node the upgrade controller processes. Enable `drain` only when `gpuPodDeletion` is insufficient to remove all GPU-using pods on its own. Adjust the `gpuPodDeletion` settings first and use `drain` only if those settings do not work. @@ -253,7 +263,7 @@ If the upgrade fails for a particular node, the node is labelled with the `upgra $ kubectl label node nvidia.com/gpu-driver-upgrade-state=upgrade-required --overwrite -## Step 2: Upgrades without the Upgrade Controller +## Upgrades without the Upgrade Controller If the upgrade controller is disabled or not supported for your GPU Operator version, a component called `k8s-driver-manager` is responsible for executing the driver upgrade process. @@ -304,10 +314,9 @@ driver: * The `DRAIN_USE_FORCE` environment variable must be enabled to evict GPU pods that are not managed by any of the replication controllers such as deployment, daemon set, stateful set, and replica set. * The `DRAIN_DELETE_EMPTYDIR_DATA` environment variable must be enabled to delete GPU pods that use the `emptyDir` type volume. -**Note:** - -Since GPU pods get evicted whenever the NVIDIA Driver daemon set specification is updated, it might not always be desirable to allow this to happen automatically. -To prevent this `daemonsets.updateStrategy` parameter in the `ClusterPolicy` can be set to [OnDelete](https://kubernetes.io/docs/tasks/manage-daemon/update-daemon-set/#daemonset-update-strategy) . -With `OnDelete` update strategy, a new driver pod with the updated spec will only get deployed on a node once the old driver pod is manually deleted. -Thus, admins can control when to rollout spec updates to driver pods on any given node. -For more information on DaemonSet update strategies, refer to the [Kubernetes documentation](https://kubernetes.io/docs/tasks/manage-daemon/update-daemon-set/#daemonset-update-strategy). +> [!NOTE] +> Since GPU pods get evicted whenever the NVIDIA Driver daemon set specification is updated, it might not always be desirable to allow this to happen automatically. +> To prevent this `daemonsets.updateStrategy` parameter in the `ClusterPolicy` can be set to [OnDelete](https://kubernetes.io/docs/tasks/manage-daemon/update-daemon-set/#daemonset-update-strategy) . +> With `OnDelete` update strategy, a new driver pod with the updated spec will only get deployed on a node once the old driver pod is manually deleted. +> Thus, admins can control when to rollout spec updates to driver pods on any given node. +> For more information on DaemonSet update strategies, refer to the [Kubernetes documentation](https://kubernetes.io/docs/tasks/manage-daemon/update-daemon-set/#daemonset-update-strategy). diff --git a/gpu-operator/.agents/skills/gpu-operator-gpudirect-rdma/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-gpudirect-rdma/SKILL.md index 578edc132..e7818b615 100644 --- a/gpu-operator/.agents/skills/gpu-operator-gpudirect-rdma/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-gpudirect-rdma/SKILL.md @@ -1,6 +1,20 @@ --- name: "gpu-operator-gpudirect-rdma" -description: "Guides users through GPUDirect RDMA and GPUDirect Storage configuration. Use when enabling high-performance networking or storage access for GPU workloads. Trigger keywords - NVIDIA GPU Operator, GPUDirect RDMA, GPUDirect Storage, networking." +description: "Guides users through GPUDirect RDMA and GPUDirect Storage configuration. Use when enabling high-performance networking or storage access for GPU workloads." +triggers: + - NVIDIA GPU Operator + - GPUDirect RDMA + - GPUDirect Storage + - networking +tags: + - gpu-operator + - nvidia + - kubernetes + - gpu + - gpudirect + - rdma + - storage + - networking --- @@ -34,7 +48,7 @@ set up the networking related components such as network device kernel drivers a workloads to take advantage of GPUDirect RDMA and GPUDirect Storage. Refer to the Network Operator [documentation](https://docs.nvidia.com/networking/software/cloud-orchestration/index.html) for installation information. -## Step 1: Common Prerequisites +## Common Prerequisites The prerequisites for configuring GPUDirect RDMA or GPUDirect Storage depend on whether you use DMA-BUF from the Linux kernel or the legacy `nvidia-peermem` kernel module. @@ -68,7 +82,7 @@ The prerequisites for configuring GPUDirect RDMA or GPUDirect Storage depend on [Deploy an AI-Ready Enterprise Platform on vSphere 7](https://www.vmware.com/docs/deploy-an-ai-ready-enterprise-platform-on-vsphere-7-update-2#vm-settings-A) document from VMWare. -## Step 2: Configuring GPUDirect RDMA +## Configuring GPUDirect RDMA ### Platform Support @@ -89,7 +103,7 @@ To use DMA-BUF and network device drivers that are installed by the Network Oper $ helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ - --version=${version} \ + --version=v26.3.1 \ ``` To use DMA-BUF and network device drivers that are installed on the host: @@ -98,7 +112,7 @@ To use DMA-BUF and network device drivers that are installed on the host: $ helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ - --version=${version} \ + --version=v26.3.1 \ --set driver.rdma.useHostMofed=true ``` @@ -232,6 +246,40 @@ correctly and that pods can perform RDMA data transfers. - Create a file, such as `demo-pod-1.yaml`, for the first pod with contents like the following: + ```yaml + apiVersion: v1 + kind: Pod + metadata: + name: demo-pod-1 + annotations: + k8s.v1.cni.cncf.io/networks: demo-macvlannetwork + # If a network with static IPAM is used replace network annotation with the below. + # k8s.v1.cni.cncf.io/networks: '[ + # { "name": "rdma-net", + # "ips": ["192.168.111.101/24"], + # "gateway": ["192.168.111.1"] + # } + # ]' + spec: + nodeSelector: + # Note: Replace hostname or remove selector altogether + kubernetes.io/hostname: nvnode1 + restartPolicy: OnFailure + containers: + - image: mellanox/cuda-perftest + name: rdma-gpu-test-ctr + securityContext: + capabilities: + add: [ "IPC_LOCK" ] + resources: + limits: + nvidia.com/gpu: 1 + rdma/rdma_shared_device_a: 1 + requests: + nvidia.com/gpu: 1 + rdma/rdma_shared_device_a: 1 + ``` + - Apply the manifest: ```console @@ -242,6 +290,40 @@ correctly and that pods can perform RDMA data transfers. - Create a file, such as `demo-pod-2.yaml`, for the second pod with contents like the following: + ```yaml + apiVersion: v1 + kind: Pod + metadata: + name: demo-pod-2 + annotations: + k8s.v1.cni.cncf.io/networks: demo-macvlannetwork + # If a network with static IPAM is used replace network annotation with the below. + # k8s.v1.cni.cncf.io/networks: '[ + # { "name": "rdma-net", + # "ips": ["192.168.111.101/24"], + # "gateway": ["192.168.111.1"] + # } + # ]' + spec: + nodeSelector: + # Note: Replace hostname or remove selector altogether + kubernetes.io/hostname: nvnode2 + restartPolicy: OnFailure + containers: + - image: mellanox/cuda-perftest + name: rdma-gpu-test-ctr + securityContext: + capabilities: + add: [ "IPC_LOCK" ] + resources: + limits: + nvidia.com/gpu: 1 + rdma/rdma_shared_device_a: 1 + requests: + nvidia.com/gpu: 1 + rdma/rdma_shared_device_a: 1 + ``` + - Apply the manifest: ```console @@ -349,7 +431,7 @@ correctly and that pods can perform RDMA data transfers. $ kubectl delete -f demo-macvlannetworks.yaml ``` -## Step 3: Using GPUDirect Storage +## Using GPUDirect Storage ### Platform Support @@ -370,7 +452,7 @@ The following sample command applies to clusters that use the Network Operator t $ helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ - --version=${version} \ + --version=v26.3.1 \ --set gds.enabled=true ``` @@ -452,7 +534,7 @@ ib_core 319488 9 rdma_cm,ib_ipoib,iw_cm,ib_umad,rdma_ucm,ib_uverb drm 491520 6 drm_kms_helper,drm_vram_helper,nvidia,mgag200,ttm ``` -## Step 4: Related Information +## Related Information Refer to the following resources for more information: diff --git a/gpu-operator/.agents/skills/gpu-operator-install-airgapped-environments/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-install-airgapped-environments/SKILL.md index a7da6e3c6..d2ac4ec34 100644 --- a/gpu-operator/.agents/skills/gpu-operator-install-airgapped-environments/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-install-airgapped-environments/SKILL.md @@ -1,6 +1,19 @@ --- name: "gpu-operator-install-airgapped-environments" -description: "Guides users through installing the GPU Operator in air-gapped or restricted network environments. Use when users need mirrored images, private registries, or offline installation steps. Trigger keywords - NVIDIA GPU Operator, air-gapped, restricted network, installation." +description: "Guides users through installing the GPU Operator in air-gapped or restricted network environments. Use when users need mirrored images, private registries, or offline installation steps." +triggers: + - NVIDIA GPU Operator + - air-gapped + - restricted network + - installation +tags: + - gpu-operator + - nvidia + - kubernetes + - gpu + - air-gapped + - private-registry + - installation --- @@ -16,14 +29,13 @@ By default, The GPU Operator requires internet access for the following reasons: 1) Container images need to be pulled during GPU Operator installation. 2) The `driver` container needs to download several OS packages prior to driver installation. - **Tip:** - - Using precompiled-drivers removes the need for the `driver` containers to - download operating system packages and removes the need to create a local package repository. -To address these requirements, it may be necessary to create a local image registry and/or a local package repository -so that the necessary images and packages are available for your cluster. In subsequent sections, we detail how to -configure the GPU Operator to use local image registries and local package repositories. If your cluster is behind -a proxy, also follow the steps from install-gpu-operator-proxy. + > [!TIP] + > Using precompiled-drivers removes the need for the `driver` containers to + > download operating system packages and removes the need to create a local package repository. + > To address these requirements, it may be necessary to create a local image registry and/or a local package repository + > so that the necessary images and packages are available for your cluster. In subsequent sections, we detail how to + > configure the GPU Operator to use local image registries and local package repositories. If your cluster is behind + > a proxy, also follow the steps from install-gpu-operator-proxy. Different steps are required for different environments with varying levels of internet connectivity. The supported use cases/environments are listed in the below table: @@ -53,26 +65,23 @@ The supported use cases/environments are listed in the below table: Repository | +--------+-----------------+--------------------+--------------------+ -**Note:** - -For Red Hat Openshift deployments in air-gapped environments (use cases 2, 3a and 3b), -refer to :external+ocpmirror-gpu-ocp-disconnected. -**Note:** - -Ensure that Kubernetes nodes can successfully reach the local DNS server(s). -Public name resolution for image registry and package repositories are -mandatory for use cases 1 and 2. -Before proceeding to the next sections, get the `values.yaml` file used for GPU Operator configuration. +> [!NOTE] +> For Red Hat Openshift deployments in air-gapped environments (use cases 2, 3a and 3b), +> refer to [Mirror GPU Operator images for disconnected OpenShift](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/mirror-gpu-ocp-disconnected.html). +> [!NOTE] +> Ensure that Kubernetes nodes can successfully reach the local DNS server(s). +> Public name resolution for image registry and package repositories are +> mandatory for use cases 1 and 2. +> Before proceeding to the next sections, get the `values.yaml` file used for GPU Operator configuration. ```console $ curl -sO https://raw.githubusercontent.com/NVIDIA/gpu-operator/v1.7.0/deployments/gpu-operator/values.yaml ``` -**Note:** +> [!NOTE] +> Replace `v1.7.0` in the above command with the version you want to use. -Replace `v1.7.0` in the above command with the version you want to use. - -## Step 1: Local Image Registry +## Local Image Registry Without internet access, the GPU Operator requires all images to be hosted in a local image registry that is accessible to all nodes in the cluster. To allow the GPU Operator to work with a local registry, users can specify local @@ -92,13 +101,13 @@ An example is shown below with the Operator container image: operator: repository: nvcr.io/nvidia image: gpu-operator - version: "${version}" + version: "v26.3.1" ``` -For instance, to pull the gpu-operator image version ${version}, use the following instruction: +For instance, to pull the gpu-operator image version v26.3.1, use the following instruction: ```console -$ docker pull nvcr.io/nvidia/gpu-operator:${version} +$ docker pull nvcr.io/nvidia/gpu-operator:v26.3.1 ``` There is one caveat with regards to the driver image. The version field must be appended by the OS name running on the worker node. @@ -121,23 +130,22 @@ To push the images to the local registry, simply tag the pulled images by prefix Using the above examples, this will result in: ```console -$ docker tag nvcr.io/nvidia/gpu-operator:${version} //gpu-operator:${version} +$ docker tag nvcr.io/nvidia/gpu-operator:v26.3.1 //gpu-operator:v26.3.1 $ docker tag nvcr.io/nvidia/driver:${recommended}-ubuntu20.04 //driver:${recommended}-ubuntu20.04 ``` Finally, push the images to the local registry: ```console -$ docker push //gpu-operator:${version} +$ docker push //gpu-operator:v26.3.1 $ docker push //driver:${recommended}-ubuntu20.04 ``` Update `values.yaml` with local registry information in the repository field. -**Note:** - -Replace below with your local image registry URL and port. -Sample of `values.yaml` for GPU Operator v1.9.0: +> [!NOTE] +> Replace below with your local image registry URL and port. +> Sample of `values.yaml` for GPU Operator v1.9.0: ```yaml operator: @@ -211,16 +219,15 @@ operator: imagePullSecrets: [] ``` -## Step 2: Local Package Repository +## Local Package Repository The `driver` container deployed as part of the GPU Operator requires certain packages to be available as part of the driver installation. In restricted internet access or air-gapped installations, users are required to create a local mirror repository for their OS distribution and make the following packages available: -**Note:** - -KERNEL_VERSION is the underlying running kernel version on the GPU node -GCC_VERSION is the gcc version matching the one used for building underlying kernel +> [!NOTE] +> KERNEL_VERSION is the underlying running kernel version on the GPU node +> GCC_VERSION is the gcc version matching the one used for building underlying kernel Configuring a local package repository is not necessary for clusters that can run precompiled-drivers. @@ -333,14 +340,14 @@ driver: name: cert-config ``` -## Step 3: Deploy GPU Operator +## Deploy GPU Operator Download and deploy GPU Operator Helm Chart with the updated `values.yaml`. Fetch the chart from the NGC repository: ```console -$ helm fetch https://helm.ngc.nvidia.com/nvidia/charts/gpu-operator-${version}.tgz +$ helm fetch https://helm.ngc.nvidia.com/nvidia/charts/gpu-operator-v26.3.1.tgz ``` Install the GPU Operator with the customized `values.yaml`: @@ -348,7 +355,7 @@ Install the GPU Operator with the customized `values.yaml`: ```console $ helm install --wait gpu-operator \ -n gpu-operator --create-namespace \ - gpu-operator-${version}.tgz \ + gpu-operator-v26.3.1.tgz \ -f values.yaml ``` diff --git a/gpu-operator/.agents/skills/gpu-operator-install-governmentready-environments/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-install-governmentready-environments/SKILL.md index d10011099..6ae82c54f 100644 --- a/gpu-operator/.agents/skills/gpu-operator-install-governmentready-environments/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-install-governmentready-environments/SKILL.md @@ -1,6 +1,18 @@ --- name: "gpu-operator-install-governmentready-environments" -description: "Guides users through government-ready GPU Operator installation considerations. Use when deploying in hardened or regulated Kubernetes environments. Trigger keywords - NVIDIA GPU Operator, government-ready, installation, Kubernetes." +description: "Guides users through government-ready GPU Operator installation considerations. Use when deploying in hardened or regulated Kubernetes environments." +triggers: + - NVIDIA GPU Operator + - government-ready + - installation + - Kubernetes +tags: + - gpu-operator + - nvidia + - kubernetes + - gpu + - government-ready + - installation --- @@ -12,18 +24,17 @@ The NVIDIA GPU Operator now offers government-ready components for NVIDIA AI Ent Government ready is NVIDIA's designation for software that meets applicable security requirements for deployment in your FedRAMP High or equivalent sovereign use case. For more information on NVIDIA's government-ready support, refer to the white paper [AI Software for Regulated Environments](https://docs.nvidia.com/ai-enterprise/planning-resource/ai-software-regulated-environments-white-paper/latest/index.html). -## Step 1: Supported GPU Operator Components +## Supported GPU Operator Components Refer to the operator-component-matrix for a full list of supported government-ready GPU Operator components. Artifacts for these components are available from the [NVIDIA NGC Catalog](https://registry.ngc.nvidia.com/orgs/nvstaging/teams/cloud-native/containers/gpu-driver-stig-fips). -**Note:** +> [!NOTE] +> Not all GPU Operator components and features are available as government-ready containers in this release. +> For example, NVIDIA GDS Driver, NVIDIA Confidential Computing Manager, and NVIDIA GDRCopy Driver are not yet supported. -Not all GPU Operator components and features are available as government-ready containers in this release. -For example, NVIDIA GDS Driver, NVIDIA Confidential Computing Manager, and NVIDIA GDRCopy Driver are not yet supported. - -## Step 2: Validated Kubernetes Distributions +## Validated Kubernetes Distributions The government-ready NVIDIA GPU Operator has been validated on the following Kubernetes distributions: @@ -32,7 +43,7 @@ The government-ready NVIDIA GPU Operator has been validated on the following Kub - Rancher Kubernetes Engine 2 with Ubuntu 24.04 - VMware VKS with Ubuntu 24.04 -## Step 3: Install Government-Ready NVIDIA GPU Operator +## Install Government-Ready NVIDIA GPU Operator Once you have your gov-ready-prerequisites configured, use the following steps to install the NVIDIA GPU Operator on Canonical Kubernetes distributions: @@ -41,9 +52,8 @@ Once you have your gov-ready-prerequisites configured, use the following steps t 1. create-ubuntu-pro-token-secret 1. deploy-nvidia-gpu-operator-gov-ready -**Note:** - -For deployment on OpenShift, refer to the :external+ocpinstall-gpu-operator-gov-ready-openshift page. +> [!NOTE] +> For deployment on OpenShift, refer to [Install GPU Operator (government-ready) on OpenShift](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/install-gpu-operator-gov-ready-openshift.html). ### Prerequisites - An active NVIDIA AI Enterprise subscription and NGC API token to access GPU Operator government-ready containers. @@ -148,7 +158,7 @@ The Ubuntu Pro Token is required for the driver container to download kernel hea Refer to [Common Chart Customization Options](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#common-chart-customization-options) for more information about installation options. -## Step 4: Update Ubuntu Pro Token in ClusterPolicy +## Update Ubuntu Pro Token in ClusterPolicy You can update your Ubuntu Pro Token after installation by editing your Ubuntu Pro Token secret. This secret name is set as value of `driver.secretEnv` of the GPU Operator ClusterPolicy. diff --git a/gpu-operator/.agents/skills/gpu-operator-install-http-proxy/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-install-http-proxy/SKILL.md index 0dd005cd7..59ead3f96 100644 --- a/gpu-operator/.agents/skills/gpu-operator-install-http-proxy/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-install-http-proxy/SKILL.md @@ -1,6 +1,18 @@ --- name: "gpu-operator-install-http-proxy" -description: "Guides users through installing the GPU Operator with HTTP proxy settings. Use when clusters require proxy configuration for image pulls or network access. Trigger keywords - NVIDIA GPU Operator, HTTP proxy, installation, Kubernetes." +description: "Guides users through installing the GPU Operator with HTTP proxy settings. Use when clusters require proxy configuration for image pulls or network access." +triggers: + - NVIDIA GPU Operator + - HTTP proxy + - installation + - Kubernetes +tags: + - gpu-operator + - nvidia + - kubernetes + - gpu + - proxy + - installation --- @@ -20,12 +32,11 @@ By default, the GPU Operator requires internet access for the following reasons: 1) Container images need to be pulled during GPU Operator installation. 2) The `driver` container needs to download several OS packages prior to driver installation. - **Tip:** - - Using precompiled-drivers removes the need for the `driver` containers to - download operating system packages. -To address these requirements, all Kubernetes nodes as well as the `driver` container need proper configuration -in order to direct traffic through the proxy. + > [!TIP] + > Using precompiled-drivers removes the need for the `driver` containers to + > download operating system packages. + > To address these requirements, all Kubernetes nodes as well as the `driver` container need proper configuration + > in order to direct traffic through the proxy. This document demonstrates how to configure the GPU Operator so that the `driver` container can successfully download packages behind a HTTP proxy. Since configuring Kubernetes/container runtime components to use @@ -33,19 +44,19 @@ a proxy is not specific to the GPU Operator, we do not include those instruction The instructions for Openshift are different, so skip the section titled proxy_config_openshift if you are not running Openshift. -## Step 1: HTTP Proxy Configuration for Openshift +## HTTP Proxy Configuration for Openshift For Openshift, it is recommended to use the cluster-wide Proxy object to provide proxy information for the cluster. Follow the procedure described in [Configuring the cluster-wide proxy](https://docs.openshift.com/container-platform/latest/networking/enable-cluster-wide-proxy.html) from Red Hat Openshift public documentation. The GPU Operator will automatically inject proxy related ENV into the `driver` container based on information present in the cluster-wide Proxy object. -## Step 2: HTTP Proxy Configuration +## HTTP Proxy Configuration First, get the `values.yaml` file used for GPU Operator configuration: ```console -$ curl -sO https://raw.githubusercontent.com/NVIDIA/gpu-operator/${version}/deployments/gpu-operator/values.yaml +$ curl -sO https://raw.githubusercontent.com/NVIDIA/gpu-operator/v26.3.1/deployments/gpu-operator/values.yaml ``` Specify `driver.env` in `values.yaml` with appropriate HTTP_PROXY, HTTPS_PROXY, and NO_PROXY environment variables @@ -68,19 +79,18 @@ driver: value: ``` -**Note:** - -* Proxy related ENV are automatically injected by GPU Operator into the `driver` container to indicate proxy information used when downloading necessary packages. -* If HTTPS Proxy server is setup then change the values of HTTPS_PROXY and https_proxy to use `https` instead. +> [!NOTE] +> * Proxy related ENV are automatically injected by GPU Operator into the `driver` container to indicate proxy information used when downloading necessary packages. +> * If HTTPS Proxy server is setup then change the values of HTTPS_PROXY and https_proxy to use `https` instead. -## Step 3: Deploy GPU Operator +## Deploy GPU Operator Download and deploy GPU Operator Helm Chart with the updated `values.yaml`. Fetch the chart from the NGC repository: ```console -$ helm fetch https://helm.ngc.nvidia.com/nvidia/charts/gpu-operator-${version}.tgz +$ helm fetch https://helm.ngc.nvidia.com/nvidia/charts/gpu-operator-v26.3.1.tgz ``` Install the GPU Operator with updated `values.yaml`: @@ -88,7 +98,7 @@ Install the GPU Operator with updated `values.yaml`: ```console $ helm install --wait gpu-operator \ -n gpu-operator --create-namespace \ - gpu-operator-${version}.tgz \ + gpu-operator-v26.3.1.tgz \ -f values.yaml ``` diff --git a/gpu-operator/.agents/skills/gpu-operator-install-nvidia-enterprise/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-install-nvidia-enterprise/SKILL.md index d06e723cb..fa033b90e 100644 --- a/gpu-operator/.agents/skills/gpu-operator-install-nvidia-enterprise/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-install-nvidia-enterprise/SKILL.md @@ -1,6 +1,18 @@ --- name: "gpu-operator-install-nvidia-enterprise" -description: "Guides users through installing the GPU Operator with NVIDIA AI Enterprise. Use when deploying licensed NVIDIA AI Enterprise GPU software on Kubernetes. Trigger keywords - NVIDIA GPU Operator, NVIDIA AI Enterprise, installation, Kubernetes." +description: "Guides users through installing the GPU Operator with NVIDIA AI Enterprise. Use when deploying licensed NVIDIA AI Enterprise GPU software on Kubernetes." +triggers: + - NVIDIA GPU Operator + - NVIDIA AI Enterprise + - installation + - Kubernetes +tags: + - gpu-operator + - nvidia + - kubernetes + - gpu + - nvidia-ai-enterprise + - installation --- @@ -21,9 +33,9 @@ For information about supported platforms, hypervisors, and operating systems, r [Product Support Matrix](https://docs.nvidia.com/ai-enterprise/latest/product-support-matrix/index.html) in the NVIDIA AI Enterprise documentation. -For information about using vGPU with Red Hat OpenShift, refer to :external+ocpnvaie-with-ocp. +For information about using vGPU with Red Hat OpenShift, refer to [NVIDIA AI Enterprise with OpenShift](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/nvaie-with-ocp.html). -## Step 1: Installing GPU Operator Using the vGPU Driver +## Installing GPU Operator Using the vGPU Driver ### Prerequisites @@ -61,7 +73,7 @@ For information about using vGPU with Red Hat OpenShift, refer to :external+ocpn $ bash gpu-operator-nvaie.sh install ``` -## Step 2: Updating NLS Client License Token +## Updating NLS Client License Token In case the NLS client license token needs to be updated, use the following procedure: @@ -75,11 +87,10 @@ Generate and download a new NLS client license token. Refer to Section 4.6 of th Rename the NLS client license token that you downloaded to `client_configuration_token.tok`. -**Warning:** - -The `configMap(configMapName)` is **deprecated** and will be removed in a future release. -Use `secrets(secretName)` instead. -Create a new `licensing-config-new` Secret object in the `gpu-operator` namespace (make sure the name of the secret is not already used in the kubernetes cluster). Both the vGPU license configuration file and the NLS client license token will be added to this Secret: +> [!WARNING] +> The `configMap(configMapName)` is **deprecated** and will be removed in a future release. +> Use `secrets(secretName)` instead. +> Create a new `licensing-config-new` Secret object in the `gpu-operator` namespace (make sure the name of the secret is not already used in the kubernetes cluster). Both the vGPU license configuration file and the NLS client license token will be added to this Secret: ```console $ kubectl create secret generic licensing-config-new \ @@ -110,7 +121,7 @@ Write and exit from the kubectl edit session (you can use :qw for instance if vi GPU Operator sequentially redeploys all the driver pods with this new licensing information. -## Step 3: Installing GPU Operator Using the Data Center Driver +## Installing GPU Operator Using the Data Center Driver This installation method is available for bare metal clusters or any cluster that does not use virtualization. @@ -128,6 +139,6 @@ To identify the correct driver branch: After identifying the correct driver version, refer to install-gpu-operator for installation instructions. Use the `--version=` argument when installing with Helm. -## Step 4: Related Information +## Related Information - [NVIDIA AI Enterprise](https://www.nvidia.com/en-us/data-center/products/ai-enterprise-suite/) web page. diff --git a/gpu-operator/.agents/skills/gpu-operator-install-nvidia-vgpu/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-install-nvidia-vgpu/SKILL.md index 111e8df78..98ce2a8bc 100644 --- a/gpu-operator/.agents/skills/gpu-operator-install-nvidia-vgpu/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-install-nvidia-vgpu/SKILL.md @@ -1,6 +1,18 @@ --- name: "gpu-operator-install-nvidia-vgpu" -description: "Guides users through installing the GPU Operator with NVIDIA vGPU. Use when deploying virtual GPU software or configuring vGPU licensing with Kubernetes. Trigger keywords - NVIDIA GPU Operator, NVIDIA vGPU, installation, Kubernetes." +description: "Guides users through installing the GPU Operator with NVIDIA vGPU. Use when deploying virtual GPU software or configuring vGPU licensing with Kubernetes." +triggers: + - NVIDIA GPU Operator + - NVIDIA vGPU + - installation + - Kubernetes +tags: + - gpu-operator + - nvidia + - kubernetes + - gpu + - vgpu + - installation --- @@ -24,13 +36,13 @@ Also replace `kubectl` in the following commands with `oc` when running on Red H NVIDIA vGPU is only supported with the NVIDIA License System. -## Step 1: Platform Support +## Platform Support For information about the supported platforms, refer to Supported Deployment Options, Hypervisors, and NVIDIA vGPU Based Products. For Red Hat OpenShift Virtualization, refer to NVIDIA GPU Operator with OpenShift Virtualization. -## Step 2: Download vGPU Software +## Download vGPU Software Perform the following steps to download the vGPU software and the latest NVIDIA vGPU driver catalog file from the NVIDIA Licensing Portal. @@ -44,7 +56,7 @@ The vGPU software is packaged as a ZIP file. Unzip the file to obtain the NVIDIA vGPU Linux guest driver. The guest driver file name follows the pattern `NVIDIA-Linux-x86_64--grid.run`. -## Step 3: Build the Driver Container +## Build the Driver Container Perform the following steps to build and push a container image that includes the vGPU Linux guest driver. @@ -102,10 +114,9 @@ Perform the following steps to build and push a container image that includes th 1. Build the driver container image. - **Note:** - - Docker is the only supported container tool for building the driver container image. - Multi-architecture builds additionally require [buildx](https://github.com/docker/buildx). + > [!NOTE] + > Docker is the only supported container tool for building the driver container image. + > Multi-architecture builds additionally require [buildx](https://github.com/docker/buildx). ```console $ VGPU_GUEST_DRIVER_VERSION=${VGPU_DRIVER_VERSION} IMAGE_NAME=${PRIVATE_REGISTRY}/driver make build-vgpuguest-${OS_TAG} @@ -127,7 +138,7 @@ Perform the following steps to build and push a container image that includes th $ VGPU_GUEST_DRIVER_VERSION=${VGPU_DRIVER_VERSION} IMAGE_NAME=${PRIVATE_REGISTRY}/driver make push-vgpuguest-${OS_TAG} ``` -## Step 4: Configure the Cluster with the vGPU License Information and the Driver Container Image +## Configure the Cluster with the vGPU License Information and the Driver Container Image 1. Create an NVIDIA vGPU license file named `gridd.conf` with contents like the following example: @@ -182,7 +193,7 @@ Perform the following steps to build and push a container image that includes th You need to specify the secret name `REGISTRY_SECRET_NAME` when you install the GPU Operator with Helm. -## Step 5: Install the Operator +## Install the Operator - Install the Operator: diff --git a/gpu-operator/.agents/skills/gpu-operator-install-outdated-kernels/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-install-outdated-kernels/SKILL.md index 5d4e6d78f..76c52f193 100644 --- a/gpu-operator/.agents/skills/gpu-operator-install-outdated-kernels/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-install-outdated-kernels/SKILL.md @@ -1,6 +1,19 @@ --- name: "gpu-operator-install-outdated-kernels" -description: "Explains how to install the GPU Operator when nodes run outdated kernels. Use when driver containers fail because kernel versions are older than supported defaults. Trigger keywords - NVIDIA GPU Operator, outdated kernels, driver containers, installation." +description: "Explains how to install the GPU Operator when nodes run outdated kernels. Use when driver containers fail because kernel versions are older than supported defaults." +triggers: + - NVIDIA GPU Operator + - outdated kernels + - driver containers + - installation +tags: + - gpu-operator + - nvidia + - kubernetes + - gpu + - kernels + - driver + - installation --- @@ -16,7 +29,7 @@ see the following error message: `Could not resolve Linux kernel version`. In general, upgrading your system to the latest kernel should fix this issue. But if this is not an option, the following is a workaround to successfully deploy the GPU Operator when GPU nodes in your cluster may not be running the latest kernel. -## Step 1: Add Archived Package Repositories +## Add Archived Package Repositories The workaround is to find the package archive containing packages for your outdated kernel and to add this repository to the package manager running inside the `driver` container. To achieve this, we can simply mount a repository list file into the `driver` container using a `ConfigMap`. @@ -87,7 +100,7 @@ Deploy GPU Operator with updated `values.yaml`: $ helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ - --version=${version} \ + --version=v26.3.1 \ -f values.yaml ``` diff --git a/gpu-operator/.agents/skills/gpu-operator-install-service-mesh/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-install-service-mesh/SKILL.md index 8ca89b4d0..7e19a39ec 100644 --- a/gpu-operator/.agents/skills/gpu-operator-install-service-mesh/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-install-service-mesh/SKILL.md @@ -1,6 +1,18 @@ --- name: "gpu-operator-install-service-mesh" -description: "Guides users through GPU Operator service mesh considerations. Use when deploying with Istio or troubleshooting sidecar injection and service mesh interactions. Trigger keywords - NVIDIA GPU Operator, service mesh, Istio, Kubernetes." +description: "Guides users through GPU Operator service mesh considerations. Use when deploying with Istio or troubleshooting sidecar injection and service mesh interactions." +triggers: + - NVIDIA GPU Operator + - service mesh + - Istio + - Kubernetes +tags: + - gpu-operator + - nvidia + - kubernetes + - gpu + - service-mesh + - istio --- @@ -8,7 +20,7 @@ description: "Guides users through GPU Operator service mesh considerations. Use # Install GPU Operator with Service Mesh -## Step 1: Special Considerations for Service Meshes +## Special Considerations for Service Meshes You can use NVIDIA GPU Operator in a cluster that uses a service mesh provided by Istio CNI or Linkerd CNI. @@ -26,7 +38,7 @@ Refer to the following documentation for more information: - [Overriding injection](https://linkerd.io/2.14/features/proxy-injection/#overriding-injection) in the Linkerd documentation. -## Step 2: Label the Namespace to Disable Injection +## Label the Namespace to Disable Injection - Label the Operator namespace to prevent automatic injection: diff --git a/gpu-operator/.agents/skills/gpu-operator-kata-containers/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-kata-containers/SKILL.md index e28bfa3a0..28c25fe0b 100644 --- a/gpu-operator/.agents/skills/gpu-operator-kata-containers/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-kata-containers/SKILL.md @@ -1,6 +1,18 @@ --- name: "gpu-operator-kata-containers" -description: "Guides users through configuring Kata Containers for GPU workloads with the GPU Operator. Use when deploying sandboxed GPU workloads with Kata Containers. Trigger keywords - NVIDIA GPU Operator, Kata Containers, sandboxed workloads, Kubernetes." +description: "Guides users through configuring Kata Containers for GPU workloads with the GPU Operator. Use when deploying sandboxed GPU workloads with Kata Containers." +triggers: + - NVIDIA GPU Operator + - Kata Containers + - sandboxed workloads + - Kubernetes +tags: + - gpu-operator + - nvidia + - kubernetes + - gpu + - kata-containers + - sandboxed-workloads --- @@ -30,12 +42,11 @@ flowchart LR a[Kubelet] --> b[CRI] --> c[Kata\nRuntime] --> d[Lightweight\nQEMU VM] --> e[Lightweight\nGuest OS] --> f[Pod] --> g[Container] ``` -**Tip:** +> [!TIP] +> This page describes deploying with Kata containers only. +> Refer to the Confidential Containers documentation if you are interested in deploying Confidential Containers with Kata Containers and the GPU Operator. -This page describes deploying with Kata containers only. -Refer to the Confidential Containers documentation if you are interested in deploying Confidential Containers with Kata Containers and the GPU Operator. - -## Step 1: Benefits of Using Kata Containers +## Benefits of Using Kata Containers The primary benefits of Kata Containers are as follows: @@ -48,7 +59,7 @@ The primary benefits of Kata Containers are as follows: * Transparent deployment of unmodified containers. -## Step 2: Limitations and Restrictions +## Limitations and Restrictions * For GPU passthrough workloads, all GPUs must be assigned to one Kata Container virtual machine. Configuring only some GPUs on a node for Kata Containers is not supported. @@ -59,7 +70,7 @@ The primary benefits of Kata Containers are as follows: * NVIDIA supports the Operator and Kata Containers with the containerd runtime only. -## Step 3: Cluster Topology Considerations +## Cluster Topology Considerations You can configure all the worker nodes in your cluster for Kata Containers or you can configure some nodes for Kata Containers and others for traditional containers. Consider the following example where node A is configured to run traditional containers and node B is configured to run Kata Containers. @@ -70,7 +81,7 @@ Consider the following example where node A is configured to run traditional con This configuration can be controlled through node labelling, as described in the Label Nodes section. You can also set `sandboxWorkloads.defaultWorkload=vm-passthrough` when you install the GPU Operator to configure all nodes to run Kata Containers by default. -## Step 4: Configure the GPU Operator for Kata Containers +## Configure the GPU Operator for Kata Containers To enable Kata Containers for GPUs on your cluster, you do the following: @@ -113,12 +124,11 @@ After installation, you can run a sample workload that uses the Kata runtime cla Refer to the documentation for your operating system. Reboot the host after configuring the bootloader. - **Note:** - - After configuring IOMMU, you might see QEMU warnings about PCI P2P DMA when running GPU workloads. - These are expected and can be safely ignored. -* Ensure that no NVIDIA GPU drivers are installed on the host. - Kata Containers uses VFIO to pass GPUs directly to the VM, and host-level GPU drivers interfere with VFIO device binding. + > [!NOTE] + > After configuring IOMMU, you might see QEMU warnings about PCI P2P DMA when running GPU workloads. + > These are expected and can be safely ignored. + > * Ensure that no NVIDIA GPU drivers are installed on the host. + > Kata Containers uses VFIO to pass GPUs directly to the VM, and host-level GPU drivers interfere with VFIO device binding. To check if NVIDIA GPU drivers are installed, run the following command: @@ -196,10 +206,9 @@ After installation, you can run a sample workload that uses the Kata runtime cla The labeling approach is useful if you want to run Kata container workloads on some nodes and traditional GPU container workloads on other nodes in your cluster. Refer to the GPU Operator Cluster Topology Considerations section for more details on what gets deployed to a Kata Container node. - **Tip:** - - Skip this section if you plan to set `sandboxWorkloads.defaultWorkload=vm-passthrough` when you install the GPU Operator. -1. Verify the node label was added: + > [!TIP] + > Skip this section if you plan to set `sandboxWorkloads.defaultWorkload=vm-passthrough` when you install the GPU Operator. + > 1. Verify the node label was added: ```console $ kubectl describe node | grep nvidia.com/gpu.workload.config @@ -248,19 +257,17 @@ The minimum required version is 3.29.0. TEST SUITE: None ``` - **Note:** - - The `--wait` flag in the install command instructs Helm to wait until the release is deployed before returning. - It can take a few minutes to return output. + > [!NOTE] + > The `--wait` flag in the install command instructs Helm to wait until the release is deployed before returning. + > It can take a few minutes to return output. There is a [known Helm issue](https://github.com/helm/helm/issues/8660) on single node clusters, that may result in the Helm command finishing before all deployed pods are finished initializing. If you are deploying to a single node cluster, you may need to wait for an additional few minutes after the Helm command completes for the `kata-deploy` pod to be in the Running state. - **Note:** - - Both `kata-deploy` and the GPU Operator deploy Node Feature Discovery (NFD) by default. - The install command includes `--set nfd.enabled=false` to prevent `kata-deploy` from deploying NFD. - The GPU Operator will deploy and manage NFD in the next step. -1. Optional: Verify that the `kata-deploy` pod is running: + > [!NOTE] + > Both `kata-deploy` and the GPU Operator deploy Node Feature Discovery (NFD) by default. + > The install command includes `--set nfd.enabled=false` to prevent `kata-deploy` from deploying NFD. + > The GPU Operator will deploy and manage NFD in the next step. + > 1. Optional: Verify that the `kata-deploy` pod is running: ```console $ kubectl get pods -n kata-system | grep kata-deploy @@ -292,12 +299,11 @@ The minimum required version is 3.29.0. The `kata-qemu-nvidia-gpu` runtime class is used with Kata Containers. The `kata-qemu-nvidia-gpu-snp` and `kata-qemu-nvidia-gpu-tdx` runtime classes are used to deploy Confidential Containers. - **Note:** - - To manage the lifecycle of Kata Containers, including upgrades and day-two operations, - install the [Kata Lifecycle Manager](https://github.com/kata-containers/lifecycle-manager). - This Argo Workflows-based tool is the recommended way to manage Kata Containers deployments. -1. Optional: If you have an issue deploying the `kata-deploy` pod or are not seeing the expected runtime classes, get the pod name and view the logs: + > [!NOTE] + > To manage the lifecycle of Kata Containers, including upgrades and day-two operations, + > install the [Kata Lifecycle Manager](https://github.com/kata-containers/lifecycle-manager). + > This Argo Workflows-based tool is the recommended way to manage Kata Containers deployments. + > 1. Optional: If you have an issue deploying the `kata-deploy` pod or are not seeing the expected runtime classes, get the pod name and view the logs: ```console $ kubectl get pods -n kata-system | grep kata-deploy @@ -334,7 +340,7 @@ Install the NVIDIA GPU Operator and configure it to deploy Kata Container compon $ helm install --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ - --version=${version} \ + --version=v26.3.1 \ --set sandboxWorkloads.enabled=true \ --set sandboxWorkloads.mode=kata \ --set nfd.enabled=true \ @@ -353,10 +359,9 @@ Install the NVIDIA GPU Operator and configure it to deploy Kata Container compon TEST SUITE: None ``` - **Tip:** - - Add `--set sandboxWorkloads.defaultWorkload=vm-passthrough` if every worker node should use Kata by default. -1. Optional: Verify that all GPU Operator pods, especially the Sandbox Device Plugin and VFIO Manager operands, are running: + > [!TIP] + > Add `--set sandboxWorkloads.defaultWorkload=vm-passthrough` if every worker node should use Kata by default. + > 1. Optional: Verify that all GPU Operator pods, especially the Sandbox Device Plugin and VFIO Manager operands, are running: ```console $ kubectl get pods -n gpu-operator @@ -376,22 +381,20 @@ Install the NVIDIA GPU Operator and configure it to deploy Kata Container compon nvidia-vfio-manager-h229x 1/1 Running 0 62s ``` - **Note:** - - It can take several minutes for all GPU Operator pods to be in the Running state. - If you are not seeing the expected output, you can view the logs for the GPU Operator pods: + > [!NOTE] + > It can take several minutes for all GPU Operator pods to be in the Running state. + > If you are not seeing the expected output, you can view the logs for the GPU Operator pods: ```console $ kubectl logs -n gpu-operator ``` Replace `` with the name of the GPU Operator pod from `kubectl get pods -n gpu-operator`. - **Note:** - - The NVIDIA Confidential Computing (CC) Manager for Kubernetes (`nvidia-cc-manager`) is deployed to all nodes configured to run Kata containers, even if you are not planning to run Confidential Containers. - This manager sets the confidential computing mode on the NVIDIA GPUs, if your GPU is capable of Confidential Computing, but will not be used if you are deploying in Kata Containers only. - Refer to Confidential Containers for more details. -1. Optional: If you have host access to the worker node, you can perform the following validation step: + > [!NOTE] + > The NVIDIA Confidential Computing (CC) Manager for Kubernetes (`nvidia-cc-manager`) is deployed to all nodes configured to run Kata containers, even if you are not planning to run Confidential Containers. + > This manager sets the confidential computing mode on the NVIDIA GPUs, if your GPU is capable of Confidential Computing, but will not be used if you are deploying in Kata Containers only. + > Refer to Confidential Containers for more details. + > 1. Optional: If you have host access to the worker node, you can perform the following validation step: a. Confirm that the host uses the `vfio-pci` device driver for GPUs: @@ -431,7 +434,7 @@ The following example installs the GPU Operator with both `P_GPU_ALIAS` and `NVS $ helm install --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ - --version=${version} \ + --version=v26.3.1 \ --set sandboxWorkloads.enabled=true \ --set sandboxWorkloads.mode=kata \ --set nfd.enabled=true \ @@ -454,7 +457,7 @@ $ kubectl get node -o json | grep nvidia.com "nvidia.com/GH100_H100L_94GB": "1" ``` -## Step 5: Run a Sample Workload +## Run a Sample Workload A pod specification for a Kata container requires the following: diff --git a/gpu-operator/.agents/skills/gpu-operator-kubevirt/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-kubevirt/SKILL.md index e57e4ae7b..721f168b2 100644 --- a/gpu-operator/.agents/skills/gpu-operator-kubevirt/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-kubevirt/SKILL.md @@ -1,6 +1,18 @@ --- name: "gpu-operator-kubevirt" -description: "Guides users through configuring the GPU Operator for KubeVirt virtual machine workloads. Use when deploying GPU-enabled VMs or troubleshooting KubeVirt GPU passthrough. Trigger keywords - NVIDIA GPU Operator, KubeVirt, virtual machines, Kubernetes." +description: "Guides users through configuring the GPU Operator for KubeVirt virtual machine workloads. Use when deploying GPU-enabled VMs or troubleshooting KubeVirt GPU passthrough." +triggers: + - NVIDIA GPU Operator + - KubeVirt + - virtual machines + - Kubernetes +tags: + - gpu-operator + - nvidia + - kubernetes + - gpu + - kubevirt + - virtual-machines --- @@ -67,7 +79,7 @@ To override the default GPU workload configuration, set the following value in ` * Users must manually add all passthrough GPU and vGPU resources to the `permittedDevices` list in the KubeVirt CR before assigning them to KubeVirt virtual machines. Refer to the [KubeVirt documentation](https://kubevirt.io/user-guide/compute/host-devices/#listing-permitted-devices) for more information. -## Step 1: Configure KubeVirt with the GPU Operator +## Configure KubeVirt with the GPU Operator After configuring the prerequisites, the high level workflow for using the GPU Operator with KubeVirt is as follows: @@ -108,11 +120,10 @@ The GPU Operator uses the value of the `nvidia.com/gpu.workload.config` label to Follow one of the below subsections for installing the GPU Operator, depending on whether you plan to use NVIDIA vGPU or not. -**Note:** - -The following commands set the `sandboxWorkloads.enabled` flag. -This `ClusterPolicy` flag controls whether the GPU Operator can provision GPU worker nodes for virtual machine workloads, in addition to container workloads. -This flag is disabled by default, meaning all nodes get provisioned with the same software to enable container workloads, and the `nvidia.com/gpu.workload.config` node label is not used. +> [!NOTE] +> The following commands set the `sandboxWorkloads.enabled` flag. +> This `ClusterPolicy` flag controls whether the GPU Operator can provision GPU worker nodes for virtual machine workloads, in addition to container workloads. +> This flag is disabled by default, meaning all nodes get provisioned with the same software to enable container workloads, and the `nvidia.com/gpu.workload.config` node label is not used. The term *sandboxing* refers to running software in a separate isolated environment, typically for added security (that is, a virtual machine). We use the term `sandbox workloads` to signify workloads that run in a virtual machine, irrespective of the virtualization technology used. @@ -124,7 +135,7 @@ Install the GPU Operator, enabling `sandboxWorkloads`: $ helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ - --version=${version} \ + --version=v26.3.1 \ --set sandboxWorkloads.enabled=true ``` @@ -154,7 +165,7 @@ Follow the steps provided in this section. $ helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ - --version=${version} \ + --version=v26.3.1 \ --set sandboxWorkloads.enabled=true \ --set vgpuManager.enabled=true \ --set vgpuManager.repository= \ @@ -338,7 +349,7 @@ spec: * `name` is a name to identify the device in the virtual machine -## Step 2: vGPU Device Configuration +## vGPU Device Configuration The vGPU Device Manager assists in creating vGPU devices on GPU worker nodes. The vGPU Device Manager allows administrators to declaratively define a set of possible vGPU device configurations they would like applied to GPUs on a node. @@ -387,11 +398,10 @@ Any existing virtual machines should be shutdown/migrated before you apply the n To apply a new configuration after GPU Operator install, update the `nvidia.com/vgpu.config` node label. -**Note:** - -On GPUs that support MIG, you have the option to select MIG-backed vGPU instances instead of time-sliced vGPU instances. -To select a MIG-backed vGPU profile, label the node with the name of the MIG-backed vGPU profile. -The following example shows how to apply a new configuration on a system with two **A10** GPUs. +> [!NOTE] +> On GPUs that support MIG, you have the option to select MIG-backed vGPU instances instead of time-sliced vGPU instances. +> To select a MIG-backed vGPU profile, label the node with the name of the MIG-backed vGPU profile. +> The following example shows how to apply a new configuration on a system with two **A10** GPUs. ```console $ nvidia-smi -L @@ -436,13 +446,12 @@ $ kubectl get node cnt-server-2 -o json | jq '.status.allocatable | with_entries } ``` -## Step 3: Building the NVIDIA vGPU Manager image +## Building the NVIDIA vGPU Manager image -**Note:** - -Building the NVIDIA vGPU Manager image is only required if you are planning to use NVIDIA vGPU. -If only planning to use PCI passthrough, skip this section. -This section covers building the NVIDIA vGPU Manager container image and pushing it to a private registry. +> [!NOTE] +> Building the NVIDIA vGPU Manager image is only required if you are planning to use NVIDIA vGPU. +> If only planning to use PCI passthrough, skip this section. +> This section covers building the NVIDIA vGPU Manager container image and pushing it to a private registry. Download the vGPU Software from the [NVIDIA Licensing Portal](https://stg.ui.licensing.nvidia.com/). @@ -468,9 +477,8 @@ $ cd gpu-driver-container $ cp /*-vgpu-kvm.run vgpu-manager/ubuntu22.04/ ``` -**Note:** - -For Red Hat OpenShift, use a directory that includes `rhel` in the directory name. For example, `vgpu-manager/rhel8`. +> [!NOTE] +> For Red Hat OpenShift, use a directory that includes `rhel` in the directory name. For example, `vgpu-manager/rhel8`. | Set the following environment variables: | `PRIVATE_REGISTRY` - name of private registry used to store driver image | `VGPU_HOST_DRIVER_VERSION` - NVIDIA vGPU Manager version downloaded from NVIDIA Software Portal diff --git a/gpu-operator/.agents/skills/gpu-operator-multiinstance/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-multiinstance/SKILL.md index dc27809af..5be2e2e4e 100644 --- a/gpu-operator/.agents/skills/gpu-operator-multiinstance/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-multiinstance/SKILL.md @@ -1,6 +1,18 @@ --- name: "gpu-operator-multiinstance" -description: "Explains MIG strategies, labels, and configuration with the GPU Operator. Use when partitioning GPUs, enabling MIG, or troubleshooting MIG resource exposure. Trigger keywords - NVIDIA GPU Operator, MIG, Multi-Instance GPU, GPU partitioning." +description: "Explains MIG strategies, labels, and configuration with the GPU Operator. Use when partitioning GPUs, enabling MIG, or troubleshooting MIG resource exposure." +triggers: + - NVIDIA GPU Operator + - MIG + - Multi-Instance GPU + - GPU partitioning +tags: + - gpu-operator + - nvidia + - kubernetes + - gpu + - mig + - gpu-partitioning --- @@ -18,7 +30,7 @@ You must enable MIG during installation by choosing a MIG strategy before you ca Refer to the architecture section for more information about how MIG is implemented in the GPU Operator. -## Step 1: Enabling MIG During Installation +## Enabling MIG During Installation Use the following steps to enable MIG and deploy MIG Manager. @@ -28,7 +40,7 @@ Use the following steps to enable MIG and deploy MIG Manager. $ helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ - --version=${version} \ + --version=v26.3.1 \ --set mig.strategy=single ``` @@ -48,12 +60,11 @@ Use the following steps to enable MIG and deploy MIG Manager. After several minutes, all GPU Operator pods, including the `nvidia-mig-manager` are deployed on nodes that have MIG capable GPUs. - **Note:** - - MIG Manager requires that no user workloads are running on the GPUs being configured. - In some cases, the node might need to be rebooted, such as a CSP, so the node might need to be cordoned - before changing the MIG mode or the MIG geometry on the GPUs. -1. Optional: Display the pods in the Operator namespace: + > [!NOTE] + > MIG Manager requires that no user workloads are running on the GPUs being configured. + > In some cases, the node might need to be rebooted, such as a CSP, so the node might need to be cordoned + > before changing the MIG mode or the MIG geometry on the GPUs. + > 1. Optional: Display the pods in the Operator namespace: ```console $ kubectl get pods -n gpu-operator @@ -69,7 +80,7 @@ Use the following steps to enable MIG and deploy MIG Manager. *Partial Output* -## Step 2: Configuring MIG Profiles +## Configuring MIG Profiles When MIG is enabled, nodes are labeled with `nvidia.com/mig.config: all-disabled` by default. To use a profile on a node, update the label value with the desired profile, for example, `nvidia.com/mig.config=all-1g.10gb`. @@ -84,9 +95,8 @@ If you need custom profiles, you can use a custom MIG configuration instead of t You can use the Helm chart to create a ConfigMap from values at install time, or create and reference your own ConfigMap. For an example, refer to dynamically-creating-the-mig-configuration-configmap. -**Note:** - -Generated MIG configuration might not be available on older drivers, such as 535 branch GPU drivers, as they do not support querying MIG profiles when MIG mode is disabled. In those cases, the GPU Operator will use a [static Configmap](https://github.com/NVIDIA/gpu-operator/blob/main/assets/state-mig-manager/0400_configmap.yaml), `default-mig-parted-config`, for MIG profiles. +> [!NOTE] +> Generated MIG configuration might not be available on older drivers, such as 535 branch GPU drivers, as they do not support querying MIG profiles when MIG mode is disabled. In those cases, the GPU Operator will use a [static Configmap](https://github.com/NVIDIA/gpu-operator/blob/main/assets/state-mig-manager/0400_configmap.yaml), `default-mig-parted-config`, for MIG profiles. ### Example: Single MIG Strategy The following steps show how to use the single MIG strategy and configure the `1g.10gb` profile on one node. @@ -292,14 +302,13 @@ In your values.yaml file, set `migManager.config.create` to `true`, set `migMana 1. In your `values.yaml` file, add the data for the ConfigMap, like the following example: -**Note:** - -Custom ConfigMaps must contain a key named "config.yaml" -1. Install or upgrade the GPU Operator with this values file so the chart creates the ConfigMap: +> [!NOTE] +> Custom ConfigMaps must contain a key named "config.yaml" +> 1. Install or upgrade the GPU Operator with this values file so the chart creates the ConfigMap: ```console $ helm upgrade --install gpu-operator -n gpu-operator --create-namespace \ - nvidia/gpu-operator --version=${version} \ + nvidia/gpu-operator --version=v26.3.1 \ -f values.yaml ``` @@ -356,10 +365,30 @@ You can create and apply a ConfigMap yourself if the default profiles do not mee 1. Create a file, such as `custom-mig-config.yaml`, with contents like the following example: -**Note:** + ```yaml + apiVersion: v1 + kind: ConfigMap + metadata: + name: custom-mig-config + data: + config.yaml: | + version: v1 + mig-configs: + all-disabled: + - devices: all + mig-enabled: false + + five-1g-one-2g: + - devices: all + mig-enabled: true + mig-devices: + "1g.10gb": 5 + "2g.20gb": 1 + ``` -Custom ConfigMaps must contain a key named "config.yaml" -1. Apply the manifest: +> [!NOTE] +> Custom ConfigMaps must contain a key named "config.yaml" +> 1. Apply the manifest: ```console $ kubectl apply -n gpu-operator -f custom-mig-config.yaml @@ -387,9 +416,9 @@ Custom ConfigMaps must contain a key named "config.yaml" $ kubectl label nodes nvidia.com/mig.config=five-1g-one-2g --overwrite ``` -## Step 3: Verification: Running Sample CUDA Workloads +## Verification: Running Sample CUDA Workloads -## Step 4: Disabling MIG +## Disabling MIG You can disable MIG on a node by setting the `nvidia.com/mig.config` label to `all-disabled`: @@ -397,7 +426,7 @@ You can disable MIG on a node by setting the `nvidia.com/mig.config` label to `a $ kubectl label nodes nvidia.com/mig.config=all-disabled --overwrite ``` -## Step 5: MIG Manager with Preinstalled Drivers +## MIG Manager with Preinstalled Drivers MIG Manager supports preinstalled drivers. Information in the preceding sections still applies, however there are a few additional details to consider. @@ -411,7 +440,7 @@ can be used to install the GPU Operator: $ helm install gpu-operator \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ - --version=${version} \ + --version=v26.3.1 \ --set driver.enabled=false ``` @@ -460,12 +489,12 @@ Alternatively, you can create a custom ConfigMap for use by MIG Manager by perfo $ helm install gpu-operator \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ - --version=${version} \ + --version=v26.3.1 \ --set migManager.gpuClientsConfig.name=gpu-clients \ --set driver.enabled=false ``` -## Step 6: Architecture +## Architecture MIG Manager is designed as a controller within Kubernetes. It watches for changes to the `nvidia.com/mig.config` label on the node and then applies the user-requested MIG configuration. diff --git a/gpu-operator/.agents/skills/gpu-operator-nvidia-amazon/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-nvidia-amazon/SKILL.md index b1688ab5f..6a581f899 100644 --- a/gpu-operator/.agents/skills/gpu-operator-nvidia-amazon/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-nvidia-amazon/SKILL.md @@ -1,6 +1,19 @@ --- name: "gpu-operator-nvidia-amazon" -description: "Guides users through installing and configuring the NVIDIA GPU Operator on Amazon EKS. Use when deploying GPU workloads on AWS or troubleshooting EKS-specific GPU Operator setup. Trigger keywords - NVIDIA GPU Operator, Amazon EKS, AWS, Kubernetes, installation." +description: "Guides users through installing and configuring the NVIDIA GPU Operator on Amazon EKS. Use when deploying GPU workloads on AWS or troubleshooting EKS-specific GPU Operator setup." +triggers: + - NVIDIA GPU Operator + - Amazon EKS + - AWS + - Kubernetes + - installation +tags: + - gpu-operator + - nvidia + - kubernetes + - gpu + - aws + - eks --- @@ -8,7 +21,7 @@ description: "Guides users through installing and configuring the NVIDIA GPU Ope # NVIDIA GPU Operator with Amazon EKS -## Step 1: Approaches for Working with Amazon EKS +## Approaches for Working with Amazon EKS You can approach running workloads in Amazon EKS with NVIDIA GPUs in at least two ways. @@ -94,7 +107,7 @@ without any limitations, you perform the following high-level actions: * Use your preferred client application to create the node group. -## Step 2: Example: Create a Self-Managed Node Group with eksctl +## Example: Create a Self-Managed Node Group with eksctl ### Prerequisites @@ -115,15 +128,39 @@ The steps create a self-managed node group that uses an Amazon EKS optimized AMI 1. Create a file, such as `cluster-config.yaml`, with contents like the following example: + ```yaml + apiVersion: eksctl.io/v1alpha5 + kind: ClusterConfig + metadata: + name: demo-cluster + region: us-west-2 + version: "1.25" + nodeGroups: + - name: demo-gpu-workers + instanceType: g4dn.xlarge + ami: ami-0770ab88ec35aa875 + amiFamily: Ubuntu2004 + minSize: 1 + desiredCapacity: 3 + maxSize: 3 + volumeSize: 100 + overrideBootstrapCommand: | + #!/bin/bash + source /var/lib/cloud/scripts/eksctl/bootstrap.helper.sh + /etc/eks/bootstrap.sh ${CLUSTER_NAME} --container-runtime containerd --kubelet-extra-args "--node-labels=${NODE_LABELS}" + ssh: + allow: true + publicKeyPath: ~/.ssh/id_rsa.pub + ``` + Replace the values for the cluster name, Kubernetes version, and so on. To resolve the environment variables in the override bootstrap command, you must source the bootstrap helper script. - **Tip:** - - The default volume size for each node is 20 GB. - In many cases, containers with frameworks for AI/ML workloads are often very large. - The sample YAML file specifies a 100 GB volume to ensure enough local disk space for containers. -1. Create the Amazon EKS cluster with the node group: + > [!TIP] + > The default volume size for each node is 20 GB. + > In many cases, containers with frameworks for AI/ML workloads are often very large. + > The sample YAML file specifies a 100 GB volume to ensure enough local disk space for containers. + > 1. Create the Amazon EKS cluster with the node group: ```console $ eksctl create cluster -f cluster-config.yaml @@ -155,7 +192,7 @@ The steps create a self-managed node group that uses an Amazon EKS optimized AMI demo-cluster us-west-2 True ``` -## Step 3: Related Information +## Related Information * The preceding procedure is derived from [Getting started with Amazon EKS - eksctl](https://docs.aws.amazon.com/eks/latest/userguide/getting-started-eksctl.html) diff --git a/gpu-operator/.agents/skills/gpu-operator-nvidia-azure/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-nvidia-azure/SKILL.md index 08d7a39d7..fc9a8925b 100644 --- a/gpu-operator/.agents/skills/gpu-operator-nvidia-azure/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-nvidia-azure/SKILL.md @@ -1,6 +1,18 @@ --- name: "gpu-operator-nvidia-azure" -description: "Guides users through installing and configuring the NVIDIA GPU Operator on Azure AKS. Use when deploying GPU workloads on Azure or troubleshooting AKS-specific GPU Operator setup. Trigger keywords - NVIDIA GPU Operator, Azure AKS, Microsoft Azure, Kubernetes." +description: "Guides users through installing and configuring the NVIDIA GPU Operator on Azure AKS. Use when deploying GPU workloads on Azure or troubleshooting AKS-specific GPU Operator setup." +triggers: + - NVIDIA GPU Operator + - Azure AKS + - Microsoft Azure + - Kubernetes +tags: + - gpu-operator + - nvidia + - kubernetes + - gpu + - azure + - aks --- @@ -8,7 +20,7 @@ description: "Guides users through installing and configuring the NVIDIA GPU Ope # NVIDIA GPU Operator with Azure Kubernetes Service -## Step 1: Approaches for Working with Azure AKS +## Approaches for Working with Azure AKS ### Create AKS Cluster with a Node Pool to Skip GPU Driver installation @@ -62,7 +74,7 @@ manage the lifecycle of these software components and others. However, using the Operator can overcome the limitations identified in the preceding section. -## Step 2: Installing the Operator for Preinstalled Driver and Toolkit +## Installing the Operator for Preinstalled Driver and Toolkit After you start your Azure AKS cluster with an image that includes a preinstalled NVIDIA GPU Driver and NVIDIA Container Toolkit, you are ready to install the NVIDIA GPU Operator. @@ -82,7 +94,7 @@ deploying NVIDIA Driver Containers and the NVIDIA Container Toolkit. ```console $ helm install gpu-operator nvidia/gpu-operator \ -n gpu-operator --create-namespace \ - --version=${version} \ + --version=v26.3.1 \ --set driver.enabled=false \ --set toolkit.enabled=false \ --set operator.runtimeClass=nvidia-container-runtime diff --git a/gpu-operator/.agents/skills/gpu-operator-nvidia-dra/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-nvidia-dra/SKILL.md index 7ec843abf..ecc992c75 100644 --- a/gpu-operator/.agents/skills/gpu-operator-nvidia-dra/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-nvidia-dra/SKILL.md @@ -1,6 +1,19 @@ --- name: "gpu-operator-nvidia-dra" -description: "Explains how to install and use the NVIDIA DRA Driver for GPUs. Use when users ask about Dynamic Resource Allocation, DRA installation, or GPU resource claims. Trigger keywords - NVIDIA GPU Operator, DRA, Dynamic Resource Allocation, Kubernetes, installation." +description: "Explains how to install and use the NVIDIA DRA Driver for GPUs. Use when users ask about Dynamic Resource Allocation, DRA installation, or GPU resource claims." +triggers: + - NVIDIA GPU Operator + - DRA + - Dynamic Resource Allocation + - Kubernetes + - installation +tags: + - gpu-operator + - nvidia + - kubernetes + - gpu + - dra + - dynamic-resource-allocation --- @@ -8,8 +21,7 @@ description: "Explains how to install and use the NVIDIA DRA Driver for GPUs. Us # Prerequisites -**Tip:** - +> [!TIP] # NVIDIA DRA Driver for GPUs Dynamic Resource Allocation (DRA) is a Kubernetes concept for flexibly requesting, configuring, and sharing specialized devices like GPUs. @@ -38,7 +50,7 @@ You can use the NVIDIA DRA Driver for GPUs with the NVIDIA GPU Operator to deplo * For A100 GPUs, the MIG manager does not automatically evict the DRA kubelet plugin during MIG configuration changes. If the DRA kubelet plugin is deployed before a MIG change, then you must manually restart the DRA kubelet plugin. -## Step 1: Install the NVIDIA GPU Operator +## Install the NVIDIA GPU Operator ### GPU Allocation @@ -59,7 +71,7 @@ You can use the NVIDIA DRA Driver for GPUs with the NVIDIA GPU Operator to deplo ```console helm upgrade --install gpu-operator nvidia/gpu-operator \ - --version=${version} \ + --version=v26.3.1 \ --create-namespace \ --namespace gpu-operator \ --set devicePlugin.enabled=false \ @@ -81,7 +93,7 @@ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \ ```console helm upgrade --install gpu-operator nvidia/gpu-operator \ - --version=${version} \ + --version=v26.3.1 \ --create-namespace \ --namespace gpu-operator ``` @@ -90,23 +102,21 @@ Refer to the [GPU Operator installation guide](https://docs.nvidia.com/datacente If you are planning to use MIG devices, refer to the [NVIDIA GPU Operator MIG documentation](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-mig.html) to configure your cluster for MIG support. -## Step 2: Install DRA Driver for GPUs - -**Note:** +## Install DRA Driver for GPUs -The `gpuResourcesEnabledOverride=true` is an additional flag that is required to fully enable GPU allocation support. -Include it in the Helm command if you want to enable GPU allocation support. +> [!NOTE] +> The `gpuResourcesEnabledOverride=true` is an additional flag that is required to fully enable GPU allocation support. +> Include it in the Helm command if you want to enable GPU allocation support. If you want to disable either functionality: * To disable GPU allocation support, include `--set resources.gpus.enabled=false` in the Helm command. * To disable ComputeDomain support, include `--set resources.computeDomains.enabled=false` in the Helm command. -**Note:** - -The `nvidiaDriverRoot` flag sets the root directory for the NVIDIA GPU driver. -The default value is `/`, which is the typical value for drivers installed directly on the host. -If you are using GPU Operator managed drivers (default), the drivers are installed to `/run/nvidia/driver` by default. -If you are using [pre-installed drivers](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#pre-installed-nvidia-gpu-drivers), you can remove the `nvidiaDriverRoot` flag or set it to `/` in the command above. +> [!NOTE] +> The `nvidiaDriverRoot` flag sets the root directory for the NVIDIA GPU driver. +> The default value is `/`, which is the typical value for drivers installed directly on the host. +> If you are using GPU Operator managed drivers (default), the drivers are installed to `/run/nvidia/driver` by default. +> If you are using [pre-installed drivers](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#pre-installed-nvidia-gpu-drivers), you can remove the `nvidiaDriverRoot` flag or set it to `/` in the command above. ### GPU Allocation 1. Create a custom `values.yaml` file for installing the DRA driver helm chart. @@ -210,7 +220,7 @@ If you are using [pre-installed drivers](https://docs.nvidia.com/datacenter/clou --set resources.gpus.enabled=false ``` -## Step 3: Validate Installation +## Validate Installation 1. Confirm that the DRA driver components are running: @@ -250,7 +260,7 @@ Additional validation steps are available in the DRA Driver repository documenta * [Validate setup for ComputeDomain allocation](https://github.com/NVIDIA/k8s-dra-driver-gpu/wiki/Validate-setup-for-ComputeDomain-allocation) * [Validate setup for GPU allocation](https://github.com/NVIDIA/k8s-dra-driver-gpu/wiki/Validate-setup-for-GPU-allocation) -## Step 4: Enable Health Checks +## Enable Health Checks The NVIDIA DRA driver supports GPU health monitoring using the [NVIDIA Management Library (NVML)](https://developer.nvidia.com/management-library-nvml). This feature uses NVML to check for [GPU XID errors](https://docs.nvidia.com/deploy/xid-errors/introduction.html) and determines if a GPU or MIG device is functioning properly. @@ -272,10 +282,9 @@ helm upgrade -i nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \ --set featureGates.NVMLDeviceHealthCheck=true ``` -**Note:** - -Unhealthy GPUs will not appear in the ResourceSlice list. After the device recovers and is marked healthy again, you must restart the DRA Driver for the device to be added back into the available resources pool. -After enabling health checks, you can monitor health status in the kubelet logs. +> [!NOTE] +> Unhealthy GPUs will not appear in the ResourceSlice list. After the device recovers and is marked healthy again, you must restart the DRA Driver for the device to be added back into the available resources pool. +> After enabling health checks, you can monitor health status in the kubelet logs. 1. Check kubelet plugin logs. Health status changes are logged in the kubelet plugin container. Run `kubectl get pods -n nvidia-dra-driver-gpu` and find the `nvidia-dra-driver-gpu-kubelet-plugin-` pod name. Replace `` with your actual pod name. @@ -300,7 +309,7 @@ After enabling health checks, you can monitor health status in the kubelet logs. kubectl get resourceslice -o yaml ``` -## Step 5: Additional Documentation +## Additional Documentation Refer to the [DRA Driver for GPUs repository](https://github.com/NVIDIA/k8s-dra-driver-gpu/wiki) for additional documentation, including diff --git a/gpu-operator/.agents/skills/gpu-operator-nvidia-driver/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-nvidia-driver/SKILL.md index e71d298b2..4b7116a0c 100644 --- a/gpu-operator/.agents/skills/gpu-operator-nvidia-driver/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-nvidia-driver/SKILL.md @@ -1,6 +1,18 @@ --- name: "gpu-operator-nvidia-driver" -description: "Explains how to configure NVIDIA GPU Driver custom resources for driver lifecycle management. Use when users need custom driver configuration or mixed operating system support. Trigger keywords - NVIDIA GPU Operator, GPU driver, custom resource, driver configuration." +description: "Explains how to configure NVIDIA GPU Driver custom resources for driver lifecycle management. Use when users need custom driver configuration or mixed operating system support." +triggers: + - NVIDIA GPU Operator + - GPU driver + - custom resource + - driver configuration +tags: + - gpu-operator + - nvidia + - kubernetes + - gpu + - driver + - custom-resource --- @@ -64,11 +76,10 @@ argument when you install the Operator with Helm. If the Operator is already installed with the default custom resource and you want to create your own driver custom resources and apply them to specific nodes, delete the default custom resource. -**Note:** - -After you delete the default custom resource, your custom resources might not reconcile -automatically due to a known issue. Refer to the v26.3.0 known issues -for the workaround. +> [!NOTE] +> After you delete the default custom resource, your custom resources might not reconcile +> automatically due to a known issue. Refer to the v26.3.0 known issues +> for the workaround. ### Feature Compatibility Driver type @@ -128,7 +139,7 @@ The following table describes some of the fields in the custom resource. | `usePrecompiled` | When set to `true`, the Operator deploys a driver container image with a precompiled driver. | `false` | | | | | `version` | Specifies the GPU driver version to install. For a data-center driver, specify a value like `580.126.20`. If you set `usePrecompiled` to `true`, specify the driver branch, such as `580`. | Refer to the operator-component-matrix. | | | | -## Step 1: Installing the NVIDIA GPU Operator +## Installing the NVIDIA GPU Operator Perform the following steps to install the GPU Operator and use the NVIDIA driver custom resources. @@ -160,7 +171,7 @@ Perform the following steps to install the GPU Operator and use the NVIDIA drive $ helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ - --version=${version} \ + --version=v26.3.1 \ --set driver.nvidiaDriverCRD.enabled=true ``` @@ -170,7 +181,7 @@ Perform the following steps to install the GPU Operator and use the NVIDIA drive 1. Apply NVIDIA driver custom resources manifests to install the NVIDIA GPU driver version, type, and so on for your nodes. Refer to the sample manifests. -## Step 2: Sample NVIDIA Driver Manifests +## Sample NVIDIA Driver Manifests ### One Driver Type and Version on All Nodes @@ -178,6 +189,45 @@ Perform the following steps to install the GPU Operator and use the NVIDIA drive 1. Create a file, such as `nvd-all.yaml`, with contents like the following: + ```yaml + apiVersion: nvidia.com/v1alpha1 + kind: NVIDIADriver + metadata: + name: nvidiadriver-sample + spec: + # use pre-compiled packages for NVIDIA driver installation. + usePrecompiled: false + driverType: gpu + repository: nvcr.io/nvidia + image: driver + version: "580.126.20" + imagePullPolicy: IfNotPresent + imagePullSecrets: [] + nodeSelector: {} + manager: {} + rdma: + enabled: false + useHostMofed: false + gds: + enabled: false + # Private mirror repository configuration + repoConfig: + name: "" + # custom ssl key/certificate configuration + certConfig: + name: "" + # vGPU licensing configuration + licensingConfig: + secretName: "" + nlsEnabled: true + # vGPU topology daemon configuration + virtualTopologyConfig: + name: "" + # kernel module configuration for NVIDIA driver + kernelModuleConfig: + name: "" + ``` + 1. Apply the manifest: ```console @@ -208,6 +258,40 @@ Perform the following steps to install the GPU Operator and use the NVIDIA drive 1. Create a file, such as `nvd-driver-multiple.yaml`, with contents like the following: + ```yaml + apiVersion: nvidia.com/v1alpha1 + kind: NVIDIADriver + metadata: + name: demo-gold + spec: + driverType: gpu + env: [] + image: driver + imagePullPolicy: IfNotPresent + imagePullSecrets: [] + manager: {} + nodeSelector: + driver.config: "gold" + repository: nvcr.io/nvidia + version: "580.126.20" + --- + apiVersion: nvidia.com/v1alpha1 + kind: NVIDIADriver + metadata: + name: demo-silver + spec: + driverType: gpu + env: [] + image: driver + imagePullPolicy: IfNotPresent + imagePullSecrets: [] + manager: {} + nodeSelector: + driver.config: "silver" + repository: nvcr.io/nvidia + version: "470.141.10" + ``` + 1. Apply the manifest: ```console @@ -226,11 +310,29 @@ Perform the following steps to install the GPU Operator and use the NVIDIA drive 1. Create a file, such as `nvd-precompiled-all.yaml`, with contents like the following: - **Tip:** + ```yaml + apiVersion: nvidia.com/v1alpha1 + kind: NVIDIADriver + metadata: + name: demo-precomp-all + spec: + driverType: gpu + env: [] + image: driver + imagePullPolicy: IfNotPresent + imagePullSecrets: [] + manager: {} + nodeSelector: {} + repository: nvcr.io/nvidia + resources: {} + usePrecompiled: true + version: "580" + ``` - Because the manifest does not include a `nodeSelector` field, the driver custom - resource selects all nodes in the cluster that have an NVIDIA GPU. -1. Apply the manifest: + > [!TIP] + > Because the manifest does not include a `nodeSelector` field, the driver custom + > resource selects all nodes in the cluster that have an NVIDIA GPU. + > 1. Apply the manifest: ```console $ kubectl apply -n gpu-operator -f nvd-precompiled-all.yaml @@ -251,7 +353,28 @@ Perform the following steps to install the GPU Operator and use the NVIDIA drive $ kubectl label node --overwrite driver.version="580" ``` -1. Create a file, such as `nvd-precomiled-some.yaml`, with contents like the following: +1. Create a file, such as `nvd-precompiled-some.yaml`, with contents like the following: + + ```yaml + apiVersion: nvidia.com/v1alpha1 + kind: NVIDIADriver + metadata: + name: demo-precomp + spec: + driverType: gpu + env: [] + image: driver + imagePullPolicy: IfNotPresent + imagePullSecrets: [] + manager: {} + nodeSelector: + driver.precompiled: "true" + driver.version: "580" + repository: nvcr.io/nvidia + resources: {} + usePrecompiled: true + version: "580" + ``` 1. Apply the manifest: @@ -265,7 +388,7 @@ Perform the following steps to install the GPU Operator and use the NVIDIA drive $ kubectl get events -n gpu-operator --sort-by='.lastTimestamp' ``` -## Step 3: Upgrading the NVIDIA GPU Driver +## Upgrading the NVIDIA GPU Driver You can upgrade the driver version by editing or patching the NVIDIA driver custom resource. diff --git a/gpu-operator/.agents/skills/gpu-operator-nvidia-google/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-nvidia-google/SKILL.md index 3f7b89d4c..0902207a4 100644 --- a/gpu-operator/.agents/skills/gpu-operator-nvidia-google/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-nvidia-google/SKILL.md @@ -1,6 +1,18 @@ --- name: "gpu-operator-nvidia-google" -description: "Guides users through installing and configuring the NVIDIA GPU Operator on Google GKE. Use when deploying GPU workloads on GKE or troubleshooting GKE-specific GPU Operator setup. Trigger keywords - NVIDIA GPU Operator, Google GKE, Kubernetes, installation." +description: "Guides users through installing and configuring the NVIDIA GPU Operator on Google GKE. Use when deploying GPU workloads on GKE or troubleshooting GKE-specific GPU Operator setup." +triggers: + - NVIDIA GPU Operator + - Google GKE + - Kubernetes + - installation +tags: + - gpu-operator + - nvidia + - kubernetes + - gpu + - google-cloud + - gke --- @@ -29,7 +41,7 @@ The preceding information relates to using GKE Standard node pools. For Autopilot Pods, using the GPU Operator is not supported, and you can refer to [Deploy GPU workloads in Autopilot](https://cloud.google.com/kubernetes-engine/docs/how-to/autopilot-gpus). -## Step 1: Using the Google Driver Installer +## Using the Google Driver Installer Perform the following steps to create a GKE cluster with the `gcloud` CLI and use Google driver installer to manage the GPU driver. You can create a node pool that uses a Container-Optimized OS node image or a Ubuntu node image. @@ -68,6 +80,23 @@ You can create a node pool that uses a Container-Optimized OS node image or a Ub 1. Create a file, such as `gpu-operator-quota.yaml`, with contents like the following example: + ```yaml + apiVersion: v1 + kind: ResourceQuota + metadata: + name: gpu-operator-quota + spec: + hard: + pods: 100 + scopeSelector: + matchExpressions: + - operator: In + scopeName: PriorityClass + values: + - system-node-critical + - system-cluster-critical + ``` + 1. Apply the resource quota: ```console @@ -106,7 +135,7 @@ You can create a node pool that uses a Container-Optimized OS node image or a Ub $ helm install --wait --generate-name \ -n gpu-operator \ nvidia/gpu-operator \ - --version=${version} \ + --version=v26.3.1 \ --set hostPaths.driverInstallDir=/home/kubernetes/bin/nvidia \ --set toolkit.installDir=/home/kubernetes/bin/nvidia \ --set cdi.enabled=true \ @@ -124,7 +153,7 @@ You can create a node pool that uses a Container-Optimized OS node image or a Ub --set-string migManager.env[0].value=true ``` -## Step 2: Using NVIDIA Driver Manager +## Using NVIDIA Driver Manager Perform the following steps to create a GKE cluster with the `gcloud` CLI and use the Operator and NVIDIA Driver Manager to manage the GPU driver. The steps create the cluster with a node pool that uses a Ubuntu and containerd node image. @@ -177,6 +206,23 @@ The steps create the cluster with a node pool that uses a Ubuntu and containerd 1. Create a file, such as `gpu-operator-quota.yaml`, with contents like the following example: + ```yaml + apiVersion: v1 + kind: ResourceQuota + metadata: + name: gpu-operator-quota + spec: + hard: + pods: 100 + scopeSelector: + matchExpressions: + - operator: In + scopeName: PriorityClass + values: + - system-node-critical + - system-cluster-critical + ``` + 1. Apply the resource quota: ```console @@ -200,7 +246,7 @@ The steps create the cluster with a node pool that uses a Ubuntu and containerd 1. Install the Operator. Refer to install the NVIDIA GPU Operator. -## Step 3: Related Information +## Related Information * If you have an existing GKE cluster, refer to [Add and manage node pools](https://cloud.google.com/kubernetes-engine/docs/how-to/node-pools) diff --git a/gpu-operator/.agents/skills/gpu-operator-precompiled-drivers/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-precompiled-drivers/SKILL.md index b0ad4008c..0a8bd6a99 100644 --- a/gpu-operator/.agents/skills/gpu-operator-precompiled-drivers/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-precompiled-drivers/SKILL.md @@ -1,6 +1,18 @@ --- name: "gpu-operator-precompiled-drivers" -description: "Explains how to use precompiled NVIDIA driver containers with the GPU Operator. Use when reducing driver build time or selecting precompiled driver images. Trigger keywords - NVIDIA GPU Operator, precompiled drivers, driver containers, Kubernetes." +description: "Explains how to use precompiled NVIDIA driver containers with the GPU Operator. Use when reducing driver build time or selecting precompiled driver images." +triggers: + - NVIDIA GPU Operator + - precompiled drivers + - driver containers + - Kubernetes +tags: + - gpu-operator + - nvidia + - kubernetes + - gpu + - driver + - precompiled-drivers --- @@ -25,7 +37,7 @@ with restricted internet access or sites with resource-constrained hardware. hosts with the x86_64 architecture and operating system versions listed in the supported-precompiled-drivers table. For information about using precompiled drivers with OpenShift Container Platform, - refer to :external+ocpgpu-operator-with-precompiled-drivers. + refer to [GPU Operator with precompiled drivers on OpenShift](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/gpu-operator-with-precompiled-drivers.html). * NVIDIA supports precompiled driver containers for the most recently released long-term servicing branch (LTSB) driver branch. @@ -36,7 +48,7 @@ with restricted internet access or sites with resource-constrained hardware. * Precompiled driver containers do not support NVIDIA vGPU or GPUDirect Storage (GDS). -## Step 1: Determining if a Precompiled Driver Container is Available +## Determining if a Precompiled Driver Container is Available The precompiled driver containers are named according to the following pattern: @@ -79,7 +91,7 @@ Use one of the following ways to check if a driver container is available for yo ... ``` -## Step 2: Enabling Precompiled Driver Container Support During Installation +## Enabling Precompiled Driver Container Support During Installation Refer to the common instructions for installing the Operator with Helm at install-gpu-operator. Specify the `--set driver.usePrecompiled=true` and `--set driver.version=` arguments like the following example command: @@ -88,7 +100,7 @@ Specify the `--set driver.usePrecompiled=true` and `--set driver.version=`. Refer to Common Chart Customization Options for information about other installation options. -## Step 3: Enabling Support After Installation +## Enabling Support After Installation Perform the following steps to enable support for precompiled driver containers: @@ -136,7 +148,7 @@ Perform the following steps to enable support for precompiled driver containers: Ensure that the pod names include a Linux kernel semantic version number like `5.15.0-69-generic`. -## Step 4: Disabling Support for Precompiled Driver Containers +## Disabling Support for Precompiled Driver Containers Perform the following steps to disable support for precompiled driver containers: @@ -166,14 +178,13 @@ Perform the following steps to disable support for precompiled driver containers Ensure that the pod names do not include a Linux kernel semantic version number. -## Step 5: Building a Custom Driver Container Image +## Building a Custom Driver Container Image If a precompiled driver container for your Linux kernel variant is not available, you can perform the following steps to build and run a container image. -**Note:** - -NVIDIA provides limited support for custom driver container images. +> [!NOTE] +> NVIDIA provides limited support for custom driver container images. ### Prerequisites * You have access to a private container registry, such as NVIDIA NGC Private Registry, and can push container images to the registry. * Your build machine has access to the internet to download operating system packages. diff --git a/gpu-operator/.agents/skills/gpu-operator-timeslicing-gpus/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-timeslicing-gpus/SKILL.md index bb8074d68..061d15e28 100644 --- a/gpu-operator/.agents/skills/gpu-operator-timeslicing-gpus/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-timeslicing-gpus/SKILL.md @@ -1,6 +1,18 @@ --- name: "gpu-operator-timeslicing-gpus" -description: "Explains GPU sharing and time-slicing configuration. Use when users need multiple workloads to share GPUs or need to configure time-sliced GPU resources. Trigger keywords - NVIDIA GPU Operator, GPU sharing, time-slicing, Kubernetes." +description: "Explains GPU sharing and time-slicing configuration. Use when users need multiple workloads to share GPUs or need to configure time-sliced GPU resources." +triggers: + - NVIDIA GPU Operator + - GPU sharing + - time-slicing + - Kubernetes +tags: + - gpu-operator + - nvidia + - kubernetes + - gpu + - gpu-sharing + - time-slicing --- @@ -8,7 +20,7 @@ description: "Explains GPU sharing and time-slicing configuration. Use when user # Time-Slicing GPUs in Kubernetes -## Step 1: Understanding Time-Slicing GPUs +## Understanding Time-Slicing GPUs The NVIDIA GPU Operator enables oversubscription of GPUs through a set of extended options for the [NVIDIA Kubernetes Device Plugin](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/k8s-device-plugin). @@ -24,12 +36,11 @@ than not being able to share at all. Internally, GPU time-slicing is used to multiplex workloads from replicas of the same underlying GPU. -**Note:** - -A typical resource request provides exclusive access to GPUs. -A request for a time-sliced GPU provides shared access. -A request for more than one time-sliced GPU does not guarantee that the pod -receives access to a proportional amount of GPU compute power. +> [!NOTE] +> A typical resource request provides exclusive access to GPUs. +> A request for a time-sliced GPU provides shared access. +> A request for more than one time-sliced GPU does not guarantee that the pod +> receives access to a proportional amount of GPU compute power. A request for more than one time-sliced GPU only specifies that the pod receives access to a GPU that is shared by other pods. @@ -120,7 +131,7 @@ nvidia.com/gpu.product = A100-SXM4-40GB-MIG-1g.5gb-SHARED If you set `renameByDefault=true`, then the value of the `nvidia.com/gpu.product` node label is not modified. -## Step 2: Configuration +## Configuration ### About Configuring GPU Time-Slicing @@ -153,6 +164,23 @@ and want to apply the same time-slicing configuration on all nodes in the cluste 1. Create a file, such as `time-slicing-config-all.yaml`, with contents like the following example: + ```yaml + apiVersion: v1 + kind: ConfigMap + metadata: + name: time-slicing-config-all + data: + any: |- + version: v1 + flags: + migStrategy: none + sharing: + timeSlicing: + resources: + - name: nvidia.com/gpu + replicas: 4 + ``` + 1. Add the config map to the same namespace as the GPU operator: ```console @@ -186,6 +214,40 @@ control which configuration is applied to which nodes. 1. Create a file, such as `time-slicing-config-fine.yaml`, with contents like the following example: + ```yaml + apiVersion: v1 + kind: ConfigMap + metadata: + name: time-slicing-config-fine + data: + a100-40gb: |- + version: v1 + flags: + migStrategy: mixed + sharing: + timeSlicing: + resources: + - name: nvidia.com/gpu + replicas: 8 + - name: nvidia.com/mig-1g.5gb + replicas: 2 + - name: nvidia.com/mig-2g.10gb + replicas: 2 + - name: nvidia.com/mig-3g.20gb + replicas: 3 + - name: nvidia.com/mig-7g.40gb + replicas: 7 + tesla-t4: |- + version: v1 + flags: + migStrategy: none + sharing: + timeSlicing: + resources: + - name: nvidia.com/gpu + replicas: 4 + ``` + 1. Add the config map to the same namespace as the GPU operator: ```console @@ -260,7 +322,7 @@ Perform the following steps to configure time-slicing before installing the oper ```console $ helm install gpu-operator nvidia/gpu-operator \ -n gpu-operator \ - --version=${version} \ + --version=v26.3.1 \ --set devicePlugin.config.name=time-slicing-config ``` @@ -285,7 +347,7 @@ $ kubectl rollout restart -n gpu-operator daemonset/nvidia-device-plugin-daemons Currently running workloads are not affected and continue to run, though NVIDIA recommends performing the restart during a maintenance period. -## Step 3: Verifying the GPU Time-Slicing Configuration +## Verifying the GPU Time-Slicing Configuration Perform the following steps to verify that the time-slicing configuration is applied successfully: @@ -350,6 +412,39 @@ Perform the following steps to verify that the time-slicing configuration is app * Create a file, such as `time-slicing-verification.yaml`, with contents like the following: + ```yaml + apiVersion: apps/v1 + kind: Deployment + metadata: + name: time-slicing-verification + labels: + app: time-slicing-verification + spec: + replicas: 5 + selector: + matchLabels: + app: time-slicing-verification + template: + metadata: + labels: + app: time-slicing-verification + spec: + tolerations: + - key: nvidia.com/gpu + operator: Exists + effect: NoSchedule + hostPID: true + containers: + - name: cuda-sample-vector-add + image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04" + command: ["/bin/bash", "-c", "--"] + args: + - while true; do /cuda-samples/vectorAdd; done + resources: + limits: + nvidia.com/gpu: 1 + ``` + * Create the deployment with multiple replicas: ```console @@ -384,7 +479,7 @@ Perform the following steps to verify that the time-slicing configuration is app deployment.apps "time-slicing-verification" deleted ``` -## Step 4: References +## References - [Blog post on GPU sharing in Kubernetes](https://developer.nvidia.com/blog/improving-gpu-utilization-in-kubernetes). - [NVIDIA Kubernetes Device Plugin](https://github.com/NVIDIA/k8s-device-plugin) repository on GitHub. diff --git a/gpu-operator/.agents/skills/gpu-operator-uninstalling-nvidia/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-uninstalling-nvidia/SKILL.md index 114d95203..bf6395866 100644 --- a/gpu-operator/.agents/skills/gpu-operator-uninstalling-nvidia/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-uninstalling-nvidia/SKILL.md @@ -1,6 +1,18 @@ --- name: "gpu-operator-uninstalling-nvidia" -description: "Guides users through uninstalling the NVIDIA GPU Operator and cleaning up related resources. Use when removing the Operator from a Kubernetes cluster. Trigger keywords - NVIDIA GPU Operator, uninstall, removal, Kubernetes." +description: "Guides users through uninstalling the NVIDIA GPU Operator and cleaning up related resources. Use when removing the Operator from a Kubernetes cluster." +triggers: + - NVIDIA GPU Operator + - uninstall + - removal + - Kubernetes +tags: + - gpu-operator + - nvidia + - kubernetes + - gpu + - uninstall + - cleanup --- @@ -71,10 +83,9 @@ Alternatively, you can delete the custom resource definition: $ kubectl delete crd clusterpolicies.nvidia.com ``` -**Note:** - -* After uninstalling the Operator, the NVIDIA driver modules might still be loaded. - Either reboot the node or unload them using the following command: +> [!NOTE] +> * After uninstalling the Operator, the NVIDIA driver modules might still be loaded. +> Either reboot the node or unload them using the following command: ```console $ sudo rmmod nvidia_modeset nvidia_uvm nvidia diff --git a/gpu-operator/.agents/skills/gpu-operator-upgrading-nvidia/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-upgrading-nvidia/SKILL.md index 243652de1..27c0f7b25 100644 --- a/gpu-operator/.agents/skills/gpu-operator-upgrading-nvidia/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-upgrading-nvidia/SKILL.md @@ -1,6 +1,18 @@ --- name: "gpu-operator-upgrading-nvidia" -description: "Guides users through upgrading the NVIDIA GPU Operator with Helm and handling CRD updates. Use when planning or performing a GPU Operator upgrade. Trigger keywords - NVIDIA GPU Operator, upgrade, Helm, Kubernetes." +description: "Guides users through upgrading the NVIDIA GPU Operator with Helm and handling CRD updates. Use when planning or performing a GPU Operator upgrade." +triggers: + - NVIDIA GPU Operator + - upgrade + - Helm + - Kubernetes +tags: + - gpu-operator + - nvidia + - kubernetes + - gpu + - upgrade + - helm --- @@ -12,7 +24,7 @@ description: "Guides users through upgrading the NVIDIA GPU Operator with Helm a # Upgrading the NVIDIA GPU Operator -## Step 1: Using Helm +## Using Helm The GPU Operator supports dynamic updates to existing resources. This ability enables the GPU Operator to ensure settings from the cluster policy specification are always applied and current. @@ -37,7 +49,7 @@ With this procedure, all existing GPU Operator resources are updated inline and 1. Specify the Operator release tag in an environment variable: ```console - $ export RELEASE_TAG=${version} + $ export RELEASE_TAG=v26.3.1 ``` 1. Apply the custom resource definitions for the cluster policy and NVIDIA driver: @@ -119,7 +131,7 @@ Starting with GPU Operator v24.9.0, the upgrade CRD Helm hook is enabled by defa 1. Specify the Operator release tag in an environment variable: ```console - $ export RELEASE_TAG=${version} + $ export RELEASE_TAG=v26.3.1 ``` 1. Update the information about the Operator chart: @@ -151,15 +163,14 @@ Starting with GPU Operator v24.9.0, the upgrade CRD Helm hook is enabled by defa --disable-openapi-validation -f values-$RELEASE_TAG.yaml --version $RELEASE_TAG ``` - **Note:** - - * Option `--disable-openapi-validation` is required in this case so that Helm will not try to validate if CR instance from the new chart is valid as per old CRD. - Since CR instance in the Chart is valid for the upgraded CRD, this will be compatible. + > [!NOTE] + > * Option `--disable-openapi-validation` is required in this case so that Helm will not try to validate if CR instance from the new chart is valid as per old CRD. + > Since CR instance in the Chart is valid for the upgraded CRD, this will be compatible. * Helm hooks used with the GPU Operator use the operator image itself. If operator image itself cannot be pulled successfully (either due to network error or an invalid NGC registry secret in case of NVAIE), hooks will fail. In this case, chart needs to be deleted using `--no-hooks` option to avoid deletion to be hung on hook failures. -## Step 2: Cluster Policy Updates +## Cluster Policy Updates The GPU Operator also supports dynamic updates to the `ClusterPolicy` CustomResource using `kubectl`: @@ -169,11 +180,11 @@ $ kubectl edit clusterpolicy After the edits are complete, Kubernetes will automatically apply the updates to cluster. -## Step 3: Additional Controls for Driver Upgrades +## Additional Controls for Driver Upgrades While most of the GPU Operator managed daemonsets can be upgraded seamlessly, the NVIDIA driver daemonset has special considerations. Refer to GPU Driver Upgrades for more information. -## Step 4: Using Operator Lifecycle Manager (OLM) in OpenShift +## Using Operator Lifecycle Manager (OLM) in OpenShift For upgrading the GPU Operator when running in OpenShift, refer to the official OpenShift documentation on [upgrading installed operators](https://docs.redhat.com/en/documentation/openshift_container_platform/latest/html/operators/administrator-tasks#olm-upgrading-operators). From 780b135861d752848f59a9353b684cb5cad1ca19 Mon Sep 17 00:00:00 2001 From: Andrew Chen Date: Mon, 1 Jun 2026 03:46:31 -0700 Subject: [PATCH 03/13] fix: repair admonition-boundary over-captures + remaining bare cross-refs - Fixed 11 cases where a numbered procedure step was swept into a GitHub alert because the flattened RST source lost the blank-line boundary (container-device, kata-containers, multiinstance, amazon, nvidia-driver). - Mapped remaining bare-text :ref: cross-refs in vgpu, multiinstance, and government-ready SKILLs to published doc links (#10). Co-Authored-By: Claude Opus 4.8 (1M context) Signed-off-by: Andrew Chen --- .../skills/gpu-operator-container-device/SKILL.md | 3 ++- .../SKILL.md | 2 +- .../gpu-operator-install-nvidia-vgpu/SKILL.md | 2 +- .../skills/gpu-operator-kata-containers/SKILL.md | 15 ++++++++++----- .../skills/gpu-operator-multiinstance/SKILL.md | 13 ++++++++----- .../skills/gpu-operator-nvidia-amazon/SKILL.md | 3 ++- .../skills/gpu-operator-nvidia-driver/SKILL.md | 3 ++- 7 files changed, 26 insertions(+), 15 deletions(-) diff --git a/gpu-operator/.agents/skills/gpu-operator-container-device/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-container-device/SKILL.md index 2eac638ab..828738eda 100644 --- a/gpu-operator/.agents/skills/gpu-operator-container-device/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-container-device/SKILL.md @@ -107,7 +107,8 @@ disable CDI and use the legacy NVIDIA Container Toolkit stack instead with the f > [!TIP] > You can run `kubectl get nodes -o wide` and view the `CONTAINER-RUNTIME` > column to determine if your nodes use CRI-O. - > 1. Disable CDI by modifying the cluster policy: + + 1. Disable CDI by modifying the cluster policy: ```console $ kubectl patch clusterpolicies.nvidia.com/cluster-policy --type='json' \ diff --git a/gpu-operator/.agents/skills/gpu-operator-install-governmentready-environments/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-install-governmentready-environments/SKILL.md index 6ae82c54f..0cda32e9b 100644 --- a/gpu-operator/.agents/skills/gpu-operator-install-governmentready-environments/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-install-governmentready-environments/SKILL.md @@ -26,7 +26,7 @@ For more information on NVIDIA's government-ready support, refer to the white pa ## Supported GPU Operator Components -Refer to the operator-component-matrix for a full list of supported government-ready GPU Operator components. +Refer to the [GPU Operator Component Matrix](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/life-cycle-policy.html#gpu-operator-component-matrix) for a full list of supported government-ready GPU Operator components. Artifacts for these components are available from the [NVIDIA NGC Catalog](https://registry.ngc.nvidia.com/orgs/nvstaging/teams/cloud-native/containers/gpu-driver-stig-fips). diff --git a/gpu-operator/.agents/skills/gpu-operator-install-nvidia-vgpu/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-install-nvidia-vgpu/SKILL.md index 98ce2a8bc..84804010e 100644 --- a/gpu-operator/.agents/skills/gpu-operator-install-nvidia-vgpu/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-install-nvidia-vgpu/SKILL.md @@ -208,7 +208,7 @@ Perform the following steps to build and push a container image that includes th ``` The preceding command installs the Operator with the default configuration. -Refer to gpu-operator-helm-chart-options for information about configuration options. +Refer to the [GPU Operator Helm chart options](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#common-chart-customization-options) for information about configuration options. ## Related Skills diff --git a/gpu-operator/.agents/skills/gpu-operator-kata-containers/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-kata-containers/SKILL.md index 28c25fe0b..3beb6d1a5 100644 --- a/gpu-operator/.agents/skills/gpu-operator-kata-containers/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-kata-containers/SKILL.md @@ -208,7 +208,8 @@ After installation, you can run a sample workload that uses the Kata runtime cla > [!TIP] > Skip this section if you plan to set `sandboxWorkloads.defaultWorkload=vm-passthrough` when you install the GPU Operator. - > 1. Verify the node label was added: + + 1. Verify the node label was added: ```console $ kubectl describe node | grep nvidia.com/gpu.workload.config @@ -267,7 +268,8 @@ The minimum required version is 3.29.0. > Both `kata-deploy` and the GPU Operator deploy Node Feature Discovery (NFD) by default. > The install command includes `--set nfd.enabled=false` to prevent `kata-deploy` from deploying NFD. > The GPU Operator will deploy and manage NFD in the next step. - > 1. Optional: Verify that the `kata-deploy` pod is running: + + 1. Optional: Verify that the `kata-deploy` pod is running: ```console $ kubectl get pods -n kata-system | grep kata-deploy @@ -303,7 +305,8 @@ The minimum required version is 3.29.0. > To manage the lifecycle of Kata Containers, including upgrades and day-two operations, > install the [Kata Lifecycle Manager](https://github.com/kata-containers/lifecycle-manager). > This Argo Workflows-based tool is the recommended way to manage Kata Containers deployments. - > 1. Optional: If you have an issue deploying the `kata-deploy` pod or are not seeing the expected runtime classes, get the pod name and view the logs: + + 1. Optional: If you have an issue deploying the `kata-deploy` pod or are not seeing the expected runtime classes, get the pod name and view the logs: ```console $ kubectl get pods -n kata-system | grep kata-deploy @@ -361,7 +364,8 @@ Install the NVIDIA GPU Operator and configure it to deploy Kata Container compon > [!TIP] > Add `--set sandboxWorkloads.defaultWorkload=vm-passthrough` if every worker node should use Kata by default. - > 1. Optional: Verify that all GPU Operator pods, especially the Sandbox Device Plugin and VFIO Manager operands, are running: + + 1. Optional: Verify that all GPU Operator pods, especially the Sandbox Device Plugin and VFIO Manager operands, are running: ```console $ kubectl get pods -n gpu-operator @@ -394,7 +398,8 @@ Install the NVIDIA GPU Operator and configure it to deploy Kata Container compon > The NVIDIA Confidential Computing (CC) Manager for Kubernetes (`nvidia-cc-manager`) is deployed to all nodes configured to run Kata containers, even if you are not planning to run Confidential Containers. > This manager sets the confidential computing mode on the NVIDIA GPUs, if your GPU is capable of Confidential Computing, but will not be used if you are deploying in Kata Containers only. > Refer to Confidential Containers for more details. - > 1. Optional: If you have host access to the worker node, you can perform the following validation step: + + 1. Optional: If you have host access to the worker node, you can perform the following validation step: a. Confirm that the host uses the `vfio-pci` device driver for GPUs: diff --git a/gpu-operator/.agents/skills/gpu-operator-multiinstance/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-multiinstance/SKILL.md index 5be2e2e4e..930b0d479 100644 --- a/gpu-operator/.agents/skills/gpu-operator-multiinstance/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-multiinstance/SKILL.md @@ -28,7 +28,7 @@ Refer to the [MIG User Guide](https://docs.nvidia.com/datacenter/tesla/mig-user- GPU Operator deploys MIG Manager to manage MIG configuration on nodes in your Kubernetes cluster. You must enable MIG during installation by choosing a MIG strategy before you can configure MIG. -Refer to the architecture section for more information about how MIG is implemented in the GPU Operator. +Refer to the [Multi-Instance GPU architecture](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-mig.html) for more information about how MIG is implemented in the GPU Operator. ## Enabling MIG During Installation @@ -56,7 +56,7 @@ Use the following steps to enable MIG and deploy MIG Manager. MIG Manager supports preinstalled drivers, meaning drivers that are not managed by the GPU Operator and you installed directly on the host. If drivers are preinstalled, also specify `--set driver.enabled=false`. - Refer to mig-with-preinstalled-drivers for more details. + Refer to [MIG with pre-installed drivers](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-mig.html) for more details. After several minutes, all GPU Operator pods, including the `nvidia-mig-manager` are deployed on nodes that have MIG capable GPUs. @@ -64,7 +64,8 @@ Use the following steps to enable MIG and deploy MIG Manager. > MIG Manager requires that no user workloads are running on the GPUs being configured. > In some cases, the node might need to be rebooted, such as a CSP, so the node might need to be cordoned > before changing the MIG mode or the MIG geometry on the GPUs. - > 1. Optional: Display the pods in the Operator namespace: + + 1. Optional: Display the pods in the Operator namespace: ```console $ kubectl get pods -n gpu-operator @@ -304,7 +305,8 @@ In your values.yaml file, set `migManager.config.create` to `true`, set `migMana > [!NOTE] > Custom ConfigMaps must contain a key named "config.yaml" -> 1. Install or upgrade the GPU Operator with this values file so the chart creates the ConfigMap: + +1. Install or upgrade the GPU Operator with this values file so the chart creates the ConfigMap: ```console $ helm upgrade --install gpu-operator -n gpu-operator --create-namespace \ @@ -388,7 +390,8 @@ You can create and apply a ConfigMap yourself if the default profiles do not mee > [!NOTE] > Custom ConfigMaps must contain a key named "config.yaml" -> 1. Apply the manifest: + +1. Apply the manifest: ```console $ kubectl apply -n gpu-operator -f custom-mig-config.yaml diff --git a/gpu-operator/.agents/skills/gpu-operator-nvidia-amazon/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-nvidia-amazon/SKILL.md index 6a581f899..43f70049e 100644 --- a/gpu-operator/.agents/skills/gpu-operator-nvidia-amazon/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-nvidia-amazon/SKILL.md @@ -160,7 +160,8 @@ The steps create a self-managed node group that uses an Amazon EKS optimized AMI > The default volume size for each node is 20 GB. > In many cases, containers with frameworks for AI/ML workloads are often very large. > The sample YAML file specifies a 100 GB volume to ensure enough local disk space for containers. - > 1. Create the Amazon EKS cluster with the node group: + + 1. Create the Amazon EKS cluster with the node group: ```console $ eksctl create cluster -f cluster-config.yaml diff --git a/gpu-operator/.agents/skills/gpu-operator-nvidia-driver/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-nvidia-driver/SKILL.md index 4b7116a0c..69edd08ac 100644 --- a/gpu-operator/.agents/skills/gpu-operator-nvidia-driver/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-nvidia-driver/SKILL.md @@ -332,7 +332,8 @@ Perform the following steps to install the GPU Operator and use the NVIDIA drive > [!TIP] > Because the manifest does not include a `nodeSelector` field, the driver custom > resource selects all nodes in the cluster that have an NVIDIA GPU. - > 1. Apply the manifest: + + 1. Apply the manifest: ```console $ kubectl apply -n gpu-operator -f nvd-precompiled-all.yaml From 7ed4069bea2794dd06301a14fdac0a1555f58ec4 Mon Sep 17 00:00:00 2001 From: Andrew Chen Date: Mon, 1 Jun 2026 03:49:09 -0700 Subject: [PATCH 04/13] feat: add Prerequisites/Verification sections (batch 1) Adds systemic Prerequisites and/or Verification sections to container-device, custom-driver, driver-upgrades, nvidia-driver, nvidia-azure, and service-mesh skills (#systemic). Fixed a bare getting-started cross-ref in service-mesh. Co-Authored-By: Claude Opus 4.8 (1M context) Signed-off-by: Andrew Chen --- .../gpu-operator-container-device/SKILL.md | 27 ++++++++++++++++++ .../gpu-operator-custom-driver/SKILL.md | 6 ++++ .../gpu-operator-driver-upgrades/SKILL.md | 6 ++++ .../SKILL.md | 22 +++++++++++++-- .../skills/gpu-operator-nvidia-azure/SKILL.md | 6 ++++ .../gpu-operator-nvidia-driver/SKILL.md | 28 +++++++++++++++++++ 6 files changed, 92 insertions(+), 3 deletions(-) diff --git a/gpu-operator/.agents/skills/gpu-operator-container-device/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-container-device/SKILL.md index 828738eda..39b39a59d 100644 --- a/gpu-operator/.agents/skills/gpu-operator-container-device/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-container-device/SKILL.md @@ -24,6 +24,12 @@ tags: This page gives an overview of CDI and NRI Plugin support in the GPU Operator. +## Prerequisites + +- A running Kubernetes cluster with NVIDIA GPU worker nodes. +- The NVIDIA GPU Operator installed (use the `gpu-operator-install` skill). +- A container runtime that supports CDI. CDI is enabled by default starting with GPU Operator v25.10.0. The NRI Plugin requires containerd v1.7.30, v2.1.x, or v2.2.x and is not supported with CRI-O. + ## About Container Device Interface (CDI) The [Container Device Interface (CDI)](https://github.com/cncf-tags/container-device-interface/blob/main/SPEC.md) @@ -199,3 +205,24 @@ clusterpolicy.nvidia.com/cluster-policy patched ``` After disabling the NRI Plugin, the `nvidia` runtime class will be created. + +## Verification + +Confirm that CDI or the NRI Plugin is configured as expected: + +1. Confirm the GPU Operator pods, including the container toolkit and device plugin, are running: + + ```console + $ kubectl get pods -n gpu-operator + ``` + + The `nvidia-container-toolkit-daemonset` and `nvidia-device-plugin-daemonset` pods should report `Running`. + +1. Run a GPU workload and confirm the GPU is injected into the container: + + ```console + $ kubectl run cuda-check --rm -it --restart=Never \ + --image=nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04 + ``` + + A successful run reports `Test PASSED`, confirming that the device was injected through CDI or the NRI Plugin. diff --git a/gpu-operator/.agents/skills/gpu-operator-custom-driver/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-custom-driver/SKILL.md index 59d263d88..76f64e470 100644 --- a/gpu-operator/.agents/skills/gpu-operator-custom-driver/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-custom-driver/SKILL.md @@ -26,6 +26,12 @@ On a machine with the driver already installed, you can list the parameter names You can pass custom parameters to the kernel modules that get loaded as part of the NVIDIA Driver installation (`nvidia`, `nvidia-modeset`, `nvidia-uvm`, and `nvidia-peermem`). +## Prerequisites + +- A running Kubernetes cluster with NVIDIA GPU worker nodes. +- The NVIDIA GPU Operator installed (use the `gpu-operator-install` skill). +- The GPU Operator deploys the NVIDIA driver as a container (`driver.enabled=true`, the default). Custom kernel-module parameters do not apply when you use pre-installed host drivers. + ## Configure Custom Driver Parameters To pass custom parameters, execute the following steps. diff --git a/gpu-operator/.agents/skills/gpu-operator-driver-upgrades/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-driver-upgrades/SKILL.md index 49e341c54..3b5115e6b 100644 --- a/gpu-operator/.agents/skills/gpu-operator-driver-upgrades/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-driver-upgrades/SKILL.md @@ -20,6 +20,12 @@ tags: # GPU Driver Upgrades +## Prerequisites + +- A running Kubernetes cluster with NVIDIA GPU worker nodes. +- The NVIDIA GPU Operator installed (use the `gpu-operator-install` skill). +- The driver deployed as a container by the Operator (`driver.enabled=true`, the default). The GPU Operator only manages the lifecycle of containerized drivers; drivers pre-installed on the host are not managed by the Operator. + ## About Upgrading the GPU Driver The NVIDIA driver daemon set requires special consideration for upgrades because the driver kernel modules must be unloaded and loaded again on each driver container restart. diff --git a/gpu-operator/.agents/skills/gpu-operator-install-service-mesh/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-install-service-mesh/SKILL.md index 7e19a39ec..9b09024a1 100644 --- a/gpu-operator/.agents/skills/gpu-operator-install-service-mesh/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-install-service-mesh/SKILL.md @@ -20,6 +20,12 @@ tags: # Install GPU Operator with Service Mesh +## Prerequisites + +- A running Kubernetes cluster with NVIDIA GPU worker nodes. +- A service mesh based on Istio CNI or Linkerd CNI installed in the cluster. +- The `kubectl` and `helm` CLIs available on a client machine. + ## Special Considerations for Service Meshes You can use NVIDIA GPU Operator in a cluster that uses a service mesh provided by Istio CNI or Linkerd CNI. @@ -52,6 +58,16 @@ Refer to the following documentation for more information: $ kubectl label namespace gpu-operator linkerd.io/inject=disabled ``` -If the GPU Operator is not already installed, refer to -getting-started -for information about custom options and common installation scenarios. +If the GPU Operator is not already installed, use the `gpu-operator-install` skill for information about custom options and common installation scenarios. + +## Verification + +After labeling the namespace and installing the Operator, confirm that the GPU Operator pods start successfully despite the service mesh: + +1. Confirm the Operator pods are running: + + ```console + $ kubectl get pods -n gpu-operator + ``` + + All operands, including the `nvidia-driver-daemonset` and `nvidia-operator-validator` pods, should report `Running` or `Completed`. If the `k8s-driver-manager` init container is stuck, confirm that sidecar injection is disabled for the `gpu-operator` namespace. diff --git a/gpu-operator/.agents/skills/gpu-operator-nvidia-azure/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-nvidia-azure/SKILL.md index fc9a8925b..a2b0781c8 100644 --- a/gpu-operator/.agents/skills/gpu-operator-nvidia-azure/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-nvidia-azure/SKILL.md @@ -20,6 +20,12 @@ tags: # NVIDIA GPU Operator with Azure Kubernetes Service +## Prerequisites + +- An Azure subscription and the Azure CLI (`az`) installed and configured. +- The `kubectl` and `helm` CLIs available on a client machine. +- An AKS cluster with a GPU-enabled node pool that uses a supported operating system. Use a node pool created with `--skip-gpu-driver-install` so that the GPU Operator manages the driver lifecycle. + ## Approaches for Working with Azure AKS ### Create AKS Cluster with a Node Pool to Skip GPU Driver installation diff --git a/gpu-operator/.agents/skills/gpu-operator-nvidia-driver/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-nvidia-driver/SKILL.md index 69edd08ac..fd0738fc6 100644 --- a/gpu-operator/.agents/skills/gpu-operator-nvidia-driver/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-nvidia-driver/SKILL.md @@ -20,6 +20,12 @@ tags: # NVIDIA GPU Driver Custom Resource Definition +## Prerequisites + +- A running Kubernetes cluster with NVIDIA GPU worker nodes. +- The NVIDIA GPU Operator installed with the driver custom resource enabled (`--set driver.nvidiaDriverCRD.enabled=true`). Use the `gpu-operator-install` skill to install the Operator. +- This feature is recommended for new cluster installations only. You cannot use ClusterPolicy-managed drivers and the `NVIDIADriver` custom resource at the same time. + ## Overview of the GPU Driver Custom Resource Definition You can create one or more instances of an NVIDIA driver (`NVIDIADriver`) custom resource @@ -418,3 +424,25 @@ When you update the custom resource, the Operator performs a rolling update of t ``` Eventually, the Operator replaces the pods that used the previous driver version with pods that use the updated driver version. + +## Verification + +Confirm that the driver custom resources are applied and the driver pods are running: + +1. List the `NVIDIADriver` custom resources and confirm their state: + + ```console + $ kubectl get nvidiadrivers + ``` + +1. Confirm the driver pods are running on the expected nodes: + + ```console + $ kubectl get pods -n gpu-operator -l app.kubernetes.io/component=nvidia-driver -o wide + ``` + + Each driver pod should report `Running`. If a pod is not progressing, inspect the events: + + ```console + $ kubectl get events -n gpu-operator --sort-by='.lastTimestamp' + ``` From ae0b54d2541a2629eba89517dfa1439e247c1db2 Mon Sep 17 00:00:00 2001 From: Andrew Chen Date: Mon, 1 Jun 2026 03:51:01 -0700 Subject: [PATCH 05/13] feat: add Prerequisites sections + fix refs (batch 2) Adds Prerequisites to outdated-kernels, airgapped, uninstalling, gpudirect-rdma, multiinstance, and timeslicing skills. Repaired the uninstalling NOTE admonition over-capture (two-bullet note + code block) and mapped timeslicing in-page section cross-refs. Co-Authored-By: Claude Opus 4.8 (1M context) Signed-off-by: Andrew Chen --- .../gpu-operator-gpudirect-rdma/SKILL.md | 7 ++++++ .../SKILL.md | 6 +++++ .../SKILL.md | 8 ++++++ .../gpu-operator-multiinstance/SKILL.md | 6 +++++ .../gpu-operator-timeslicing-gpus/SKILL.md | 14 +++++++---- .../gpu-operator-uninstalling-nvidia/SKILL.md | 25 ++++++++++++------- 6 files changed, 52 insertions(+), 14 deletions(-) diff --git a/gpu-operator/.agents/skills/gpu-operator-gpudirect-rdma/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-gpudirect-rdma/SKILL.md index e7818b615..d93f82738 100644 --- a/gpu-operator/.agents/skills/gpu-operator-gpudirect-rdma/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-gpudirect-rdma/SKILL.md @@ -22,6 +22,13 @@ tags: # GPUDirect RDMA and GPUDirect Storage +## Prerequisites + +- A running Kubernetes cluster with NVIDIA GPU worker nodes. +- The NVIDIA GPU Operator installed (use the `gpu-operator-install` skill). +- NVIDIA Network Operator installed for RDMA-capable networking, and compatible RDMA-capable NICs on the GPU nodes. +- A supported NVIDIA Open GPU Kernel module driver, which is required for GPUDirect Storage. + ## About GPUDirect RDMA and GPUDirect Storage [GPUDirect RDMA](https://docs.nvidia.com/cuda/gpudirect-rdma/index.html) is a technology in NVIDIA GPUs that enables direct diff --git a/gpu-operator/.agents/skills/gpu-operator-install-airgapped-environments/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-install-airgapped-environments/SKILL.md index d2ac4ec34..0257f436b 100644 --- a/gpu-operator/.agents/skills/gpu-operator-install-airgapped-environments/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-install-airgapped-environments/SKILL.md @@ -21,6 +21,12 @@ tags: # Install NVIDIA GPU Operator in Air-Gapped Environments +## Prerequisites + +- A running Kubernetes cluster with NVIDIA GPU worker nodes that has restricted or no internet access. +- A private container registry reachable from the cluster, and a local package repository or HTTP proxy for operating-system packages. +- The `kubectl` and `helm` CLIs available on a client machine, plus a workstation with internet access for mirroring images and charts. + ## About Air-Gapped Installations This page describes how to successfully deploy the GPU Operator in clusters with restricted internet access. diff --git a/gpu-operator/.agents/skills/gpu-operator-install-outdated-kernels/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-install-outdated-kernels/SKILL.md index 76c52f193..aa0c8ade9 100644 --- a/gpu-operator/.agents/skills/gpu-operator-install-outdated-kernels/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-install-outdated-kernels/SKILL.md @@ -21,6 +21,14 @@ tags: # Considerations when Installing with Outdated Kernels in Cluster +## Prerequisites + +- A running Kubernetes cluster with NVIDIA GPU worker nodes. +- The `kubectl` and `helm` CLIs available on a client machine. +- One or more GPU nodes whose running kernel is not the latest available kernel, where the `driver` container reports `Could not resolve Linux kernel version`. + +## About This Workaround + The `driver` container deployed as part of the GPU Operator requires certain packages to be available as part of the driver installation. On GPU nodes where the running kernel is not the latest, the `driver` container may fail to find the right version of these packages (e.g. kernel-headers, kernel-devel) that correspond to the running kernel version. In the `driver` container logs, you will most likely diff --git a/gpu-operator/.agents/skills/gpu-operator-multiinstance/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-multiinstance/SKILL.md index 930b0d479..e2c048e9f 100644 --- a/gpu-operator/.agents/skills/gpu-operator-multiinstance/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-multiinstance/SKILL.md @@ -20,6 +20,12 @@ tags: # GPU Operator with MIG +## Prerequisites + +- A running Kubernetes cluster with NVIDIA GPU worker nodes. +- The NVIDIA GPU Operator installed (use the `gpu-operator-install` skill). +- One or more MIG-capable NVIDIA GPUs (such as A100, A30, H100, or H200). The MIG Manager runs by default only on nodes with GPUs that support MIG. + ## About Multi-Instance GPU Multi-Instance GPU (MIG) enables GPUs based on the NVIDIA Ampere and later architectures, such as NVIDIA A100, to be partitioned into separate and secure GPU instances for CUDA applications. diff --git a/gpu-operator/.agents/skills/gpu-operator-timeslicing-gpus/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-timeslicing-gpus/SKILL.md index 061d15e28..c85222830 100644 --- a/gpu-operator/.agents/skills/gpu-operator-timeslicing-gpus/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-timeslicing-gpus/SKILL.md @@ -20,6 +20,12 @@ tags: # Time-Slicing GPUs in Kubernetes +## Prerequisites + +- A running Kubernetes cluster with NVIDIA GPU worker nodes. +- The NVIDIA GPU Operator installed (use the `gpu-operator-install` skill). +- NVIDIA GPUs that support time-slicing. Time-slicing shares access to a GPU among workloads without memory or fault isolation; for hardware-isolated partitioning, use MIG (use the `gpu-operator-multiinstance` skill). + ## Understanding Time-Slicing GPUs The NVIDIA GPU Operator enables oversubscription of GPUs through a set @@ -87,7 +93,7 @@ the mixed MIG strategy. - DCGM-Exporter does not support associating metrics to containers when GPU time-slicing is enabled with the NVIDIA Kubernetes Device Plugin. - The Operator does not monitor changes to a time-slicing config map. - Refer to time-slicing-update-config-map. + Refer to the **Updating a Time-Slicing Config Map** section. ### Changes to Node Labels @@ -308,8 +314,7 @@ Perform the following steps to configure time-slicing before installing the oper 1. Create a file, such as `time-slicing-config.yaml`, with the config map contents. - Refer to the time-slicing-cluster-wide-config or - time-slicing-node-specific-config sections. + Refer to the **Applying One Cluster-Wide Configuration** or **Applying Multiple Node-Specific Configurations** sections. 1. Add the config map to the same namespace as the GPU operator: @@ -326,8 +331,7 @@ Perform the following steps to configure time-slicing before installing the oper --set devicePlugin.config.name=time-slicing-config ``` -1. Refer to either time-slicing-cluster-wide-config or - time-slicing-node-specific-config and perform the following tasks: +1. Refer to either the **Applying One Cluster-Wide Configuration** or **Applying Multiple Node-Specific Configurations** section and perform the following tasks: * Configure the device plugin by running the `kubectl patch` command. * Apply labels to nodes if you added a config map with node-specific configurations. diff --git a/gpu-operator/.agents/skills/gpu-operator-uninstalling-nvidia/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-uninstalling-nvidia/SKILL.md index bf6395866..0fe63edb8 100644 --- a/gpu-operator/.agents/skills/gpu-operator-uninstalling-nvidia/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-uninstalling-nvidia/SKILL.md @@ -20,6 +20,13 @@ tags: # Uninstalling the GPU Operator +## Prerequisites + +- A Kubernetes cluster with the NVIDIA GPU Operator installed. +- The `kubectl` and `helm` CLIs available on a client machine, with access to the cluster and the namespace where the Operator is installed (typically `gpu-operator`). + +## Procedure + Perform the following steps to uninstall the Operator. 1. Optional: List and delete NVIDIA driver custom resources. @@ -84,13 +91,13 @@ $ kubectl delete crd clusterpolicies.nvidia.com ``` > [!NOTE] -> * After uninstalling the Operator, the NVIDIA driver modules might still be loaded. +> - After uninstalling the Operator, the NVIDIA driver modules might still be loaded. > Either reboot the node or unload them using the following command: - - ```console - $ sudo rmmod nvidia_modeset nvidia_uvm nvidia - ``` - -* Helm hooks used with the GPU Operator use the Operator image itself. - If the Operator image cannot be pulled successfully (either due to network error or an invalid NGC registry secret in case of NVAIE), hooks will fail. - In this case, delete the chart and specify the `--no-hooks` argument to avoid hanging on hook failures. +> +> ```console +> $ sudo rmmod nvidia_modeset nvidia_uvm nvidia +> ``` +> +> - Helm hooks used with the GPU Operator use the Operator image itself. +> If the Operator image cannot be pulled successfully (either due to network error or an invalid NGC registry secret in case of NVAIE), hooks will fail. +> In this case, delete the chart and specify the `--no-hooks` argument to avoid hanging on hook failures. From 88fdd263d140021d340b441d0c51dc1ac5187415 Mon Sep 17 00:00:00 2001 From: Andrew Chen Date: Mon, 1 Jun 2026 03:56:09 -0700 Subject: [PATCH 06/13] feat: complete Prerequisites/Verification pass + fix title H1 artifacts - Restored title H1s that the converter replaced with a bare # Prerequisites heading, moving prerequisites to a ## section after the real title (http-proxy, vgpu, kubevirt, dra, google, upgrading). - Restored dropped prerequisite list bodies from source RST (vgpu, kubevirt, google, upgrading). - Fixed the google mangled approaches table (finding #4) and stray '- name: RUNTIME_CONFIG_SOURCE' YAML fragment leak (finding #3). - Fixed http-proxy admonition over-capture + proxy_config_openshift ref. - Added Prerequisites to gov-ready, kata, precompiled, amazon, nvaie. - Added Verification sections to nvaie, vgpu, amazon, google, upgrading, service-mesh, nvidia-driver, container-device. - Mapped remaining bare cross-refs (nvaie component-matrix/install, google/upgrading skill refs). Co-Authored-By: Claude Opus 4.8 (1M context) Signed-off-by: Andrew Chen --- .../SKILL.md | 8 ++++ .../gpu-operator-install-http-proxy/SKILL.md | 22 +++++----- .../SKILL.md | 30 +++++++++++++- .../gpu-operator-install-nvidia-vgpu/SKILL.md | 31 ++++++++++++-- .../gpu-operator-kata-containers/SKILL.md | 7 ++++ .../skills/gpu-operator-kubevirt/SKILL.md | 15 ++++++- .../gpu-operator-nvidia-amazon/SKILL.md | 26 ++++++++++++ .../skills/gpu-operator-nvidia-dra/SKILL.md | 14 +++++-- .../gpu-operator-nvidia-google/SKILL.md | 40 ++++++++++++++----- .../gpu-operator-precompiled-drivers/SKILL.md | 6 +++ .../gpu-operator-upgrading-nvidia/SKILL.md | 25 ++++++++++-- 11 files changed, 188 insertions(+), 36 deletions(-) diff --git a/gpu-operator/.agents/skills/gpu-operator-install-governmentready-environments/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-install-governmentready-environments/SKILL.md index 0cda32e9b..1e1c35459 100644 --- a/gpu-operator/.agents/skills/gpu-operator-install-governmentready-environments/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-install-governmentready-environments/SKILL.md @@ -20,6 +20,14 @@ tags: # NVIDIA GPU Operator Government Ready +## Prerequisites + +- A running Kubernetes cluster with NVIDIA GPU worker nodes. +- The `kubectl` and `helm` CLIs available on a client machine. +- An NVIDIA AI Enterprise subscription. Government-ready components are available to NVIDIA AI Enterprise customers for FedRAMP High or equivalent sovereign use cases. + +## Overview + The NVIDIA GPU Operator now offers government-ready components for NVIDIA AI Enterprise customers. Government ready is NVIDIA's designation for software that meets applicable security requirements for deployment in your FedRAMP High or equivalent sovereign use case. For more information on NVIDIA's government-ready support, refer to the white paper [AI Software for Regulated Environments](https://docs.nvidia.com/ai-enterprise/planning-resource/ai-software-regulated-environments-white-paper/latest/index.html). diff --git a/gpu-operator/.agents/skills/gpu-operator-install-http-proxy/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-install-http-proxy/SKILL.md index 59ead3f96..2b4ee2ce5 100644 --- a/gpu-operator/.agents/skills/gpu-operator-install-http-proxy/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-install-http-proxy/SKILL.md @@ -18,31 +18,31 @@ tags: -# Prerequisites +# Install GPU Operator in Proxy Environments -* Kubernetes cluster is configured with HTTP proxy settings (container runtime should be enabled with HTTP proxy) +## Prerequisites -# Install GPU Operator in Proxy Environments +- A Kubernetes cluster configured with HTTP proxy settings, where the container runtime is enabled with the HTTP proxy. +- The `kubectl` and `helm` CLIs available on a client machine. ## Introduction This page describes how to successfully deploy the GPU Operator in clusters behind an HTTP proxy. By default, the GPU Operator requires internet access for the following reasons: - 1) Container images need to be pulled during GPU Operator installation. - 2) The `driver` container needs to download several OS packages prior to driver installation. +1. Container images need to be pulled during GPU Operator installation. +1. The `driver` container needs to download several OS packages prior to driver installation. + + > [!TIP] + > Using precompiled drivers removes the need for the `driver` containers to download operating system packages (use the `gpu-operator-precompiled-drivers` skill). - > [!TIP] - > Using precompiled-drivers removes the need for the `driver` containers to - > download operating system packages. - > To address these requirements, all Kubernetes nodes as well as the `driver` container need proper configuration - > in order to direct traffic through the proxy. +To address these requirements, all Kubernetes nodes as well as the `driver` container need proper configuration in order to direct traffic through the proxy. This document demonstrates how to configure the GPU Operator so that the `driver` container can successfully download packages behind a HTTP proxy. Since configuring Kubernetes/container runtime components to use a proxy is not specific to the GPU Operator, we do not include those instructions here. -The instructions for Openshift are different, so skip the section titled proxy_config_openshift if you are not running Openshift. +The instructions for Openshift are different, so skip the **HTTP Proxy Configuration for Openshift** section if you are not running Openshift. ## HTTP Proxy Configuration for Openshift diff --git a/gpu-operator/.agents/skills/gpu-operator-install-nvidia-enterprise/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-install-nvidia-enterprise/SKILL.md index fa033b90e..13eb70672 100644 --- a/gpu-operator/.agents/skills/gpu-operator-install-nvidia-enterprise/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-install-nvidia-enterprise/SKILL.md @@ -20,6 +20,12 @@ tags: # NVIDIA AI Enterprise +## Prerequisites + +- A running Kubernetes cluster with NVIDIA GPU worker nodes. +- The `kubectl` and `helm` CLIs available on a client machine. +- An NVIDIA AI Enterprise subscription with access to the NVIDIA Enterprise Catalog (NGC) and an NGC API key for the private registry. + ## About NVIDIA AI Enterprise and Supported Platforms NVIDIA AI Enterprise is an end-to-end, cloud-native suite of AI and data analytics software, optimized, certified, and supported by NVIDIA with NVIDIA-Certified Systems. @@ -134,11 +140,31 @@ To identify the correct driver branch: For example, NVIDIA AI Enterprise Infra 7.x uses the R580 driver branch. -1. Refer to operator-component-matrix to identify the recommended GPU Operator version and driver version that uses the same driver branch. +1. Refer to the [GPU Operator Component Matrix](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/life-cycle-policy.html#gpu-operator-component-matrix) to identify the recommended GPU Operator version and driver version that uses the same driver branch. -After identifying the correct driver version, refer to install-gpu-operator for installation instructions. +After identifying the correct driver version, use the `gpu-operator-install` skill for installation instructions. Use the `--version=` argument when installing with Helm. +## Verification + +Confirm that the Operator installed with the NVIDIA AI Enterprise components and that licensing succeeded: + +1. Confirm the Operator pods are running: + + ```console + $ kubectl get pods -n gpu-operator + ``` + + The driver pods should report `Running` and the `nvidia-operator-validator` pod should report `Completed`. + +1. Confirm the driver acquired a valid license: + + ```console + $ kubectl exec -it -n gpu-operator -- nvidia-smi -q | grep -i "License Status" + ``` + + The license status should report `Licensed`. + ## Related Information - [NVIDIA AI Enterprise](https://www.nvidia.com/en-us/data-center/products/ai-enterprise-suite/) web page. diff --git a/gpu-operator/.agents/skills/gpu-operator-install-nvidia-vgpu/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-install-nvidia-vgpu/SKILL.md index 84804010e..24501c427 100644 --- a/gpu-operator/.agents/skills/gpu-operator-install-nvidia-vgpu/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-install-nvidia-vgpu/SKILL.md @@ -18,11 +18,26 @@ tags: -# Prerequisites +# Using NVIDIA vGPU + +## Prerequisites Before installing the GPU Operator on NVIDIA vGPU, ensure the following: -# Using NVIDIA vGPU +- The NVIDIA vGPU Host Driver version 12.0 (or later) is pre-installed on all hypervisors hosting NVIDIA vGPU accelerated Kubernetes worker node virtual machines. Refer to the [NVIDIA Virtual GPU Software Documentation](https://docs.nvidia.com/grid/) for details. +- You must have access to the NVIDIA Enterprise Application Hub at https://nvid.nvidia.com/dashboard/ and the NVIDIA Licensing Portal. +- Your organization must have an instance of a Cloud License Service (CLS) or a Delegated License Service (DLS). +- You must generate and download a client configuration token for your CLS instance or DLS instance. Refer to the [NVIDIA License System Quick Start Guide](https://docs.nvidia.com/license-system/latest/nvidia-license-system-quick-start-guide/) for information about generating a token. + + > [!NOTE] + > For vGPU 18.0 and later, ensure that you use DLS 3.4 or later. + +- You have access to a private registry such as NVIDIA NGC Private Registry and can push container images to the registry. +- Git and Docker are required to build the vGPU driver image from the source repository and push it to the private registry. +- Each Kubernetes worker node in the cluster has access to the private registry. Private registry access is usually managed through image pull secrets. You specify the secrets to the NVIDIA GPU Operator when you install the Operator with Helm. + + > [!NOTE] + > Uploading the NVIDIA vGPU driver to a publicly available repository or otherwise publicly sharing the driver is a violation of the NVIDIA vGPU EULA. ## About Installing the Operator and NVIDIA vGPU @@ -210,6 +225,14 @@ Perform the following steps to build and push a container image that includes th The preceding command installs the Operator with the default configuration. Refer to the [GPU Operator Helm chart options](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#common-chart-customization-options) for information about configuration options. -## Related Skills +## Verification + +Confirm that the Operator installed and the vGPU driver pods are running: + +1. Confirm the Operator pods, including the vGPU driver, are running: + + ```console + $ kubectl get pods -n gpu-operator + ``` -- verify gpu operator install + The `nvidia-vgpu-driver-daemonset` pods should report `Running` and the `nvidia-operator-validator` pod should report `Completed`. For general post-install validation, use the `gpu-operator-install` skill's verification steps. diff --git a/gpu-operator/.agents/skills/gpu-operator-kata-containers/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-kata-containers/SKILL.md index 3beb6d1a5..960a9298f 100644 --- a/gpu-operator/.agents/skills/gpu-operator-kata-containers/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-kata-containers/SKILL.md @@ -20,6 +20,13 @@ tags: # Deploy with Kata Containers +## Prerequisites + +- A running Kubernetes cluster with NVIDIA GPU worker nodes, and the `kubectl` and `helm` CLIs available. +- Hosts configured to enable hardware virtualization and Access Control Services (ACS) in the BIOS. With some AMD CPUs and BIOSes, ACS might be grouped under Advanced Error Reporting (AER). +- Hosts configured to support IOMMU. Check with `ls /sys/kernel/iommu_groups`; if the host is not configured, add the `intel_iommu=on` (or `amd_iommu=on` for AMD CPUs) kernel command-line argument. +- For Kubernetes versions older than v1.34, the `KubeletPodResourcesGet` feature gate must be explicitly enabled. + ## About the Operator with Kata Containers [Kata Containers](https://katacontainers.io/) is an open source project that creates lightweight Virtual Machines (VMs) that feel and perform like traditional containers such as a Docker container. diff --git a/gpu-operator/.agents/skills/gpu-operator-kubevirt/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-kubevirt/SKILL.md index 721f168b2..511bc8d64 100644 --- a/gpu-operator/.agents/skills/gpu-operator-kubevirt/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-kubevirt/SKILL.md @@ -18,11 +18,22 @@ tags: -# Prerequisites +# GPU Operator with KubeVirt + +## Prerequisites Before using KubeVirt with the GPU Operator, ensure the following prerequisites are configured on your cluster and nodes: -# GPU Operator with KubeVirt +- The virtualization and IOMMU extensions (Intel VT-d or AMD IOMMU) are enabled in the BIOS. +- The host is booted with `intel_iommu=on` or `amd_iommu=on` on the kernel command line. +- If planning to use NVIDIA vGPU, SR-IOV must be enabled in the BIOS if your GPUs are based on the NVIDIA Ampere architecture or later. Refer to the [NVIDIA vGPU Documentation](https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#prereqs-vgpu) to ensure you have met all the prerequisites for using NVIDIA vGPU. +- KubeVirt is installed in the cluster. +- Starting with KubeVirt v0.58.2 and v0.59.1, set the `DisableMDEVConfiguration` feature gate: + + ```console + $ kubectl patch kubevirt -n kubevirt kubevirt --type='json' \ + -p='[{"op": "add", "path": "/spec/configuration/developerConfiguration/featureGates/-", "value": "DisableMDEVConfiguration" }]' + ``` ## About the Operator with KubeVirt diff --git a/gpu-operator/.agents/skills/gpu-operator-nvidia-amazon/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-nvidia-amazon/SKILL.md index 43f70049e..836fc55ce 100644 --- a/gpu-operator/.agents/skills/gpu-operator-nvidia-amazon/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-nvidia-amazon/SKILL.md @@ -21,6 +21,12 @@ tags: # NVIDIA GPU Operator with Amazon EKS +## Prerequisites + +- An AWS account, plus the AWS CLI and `eksctl` installed and configured (see the per-example prerequisites below for details). +- The `kubectl` and `helm` CLIs available on a client machine. +- An Amazon EKS cluster, or the ability to create one, with a GPU-enabled node group that uses an AMI with an operating system that the GPU Operator supports. + ## Approaches for Working with Amazon EKS You can approach running workloads in Amazon EKS with NVIDIA GPUs in at least two ways. @@ -193,6 +199,26 @@ The steps create a self-managed node group that uses an Amazon EKS optimized AMI demo-cluster us-west-2 True ``` +## Verification + +After the node group is created and the GPU Operator is installed on the cluster (use the `gpu-operator-install` skill), confirm that the GPU nodes are managed: + +1. Confirm the GPU nodes advertise GPU capacity: + + ```console + $ kubectl get nodes -o json | jq '.items[].status.capacity."nvidia.com/gpu"' + ``` + + Each GPU node should report a non-null GPU count. + +1. Confirm the GPU Operator pods are running: + + ```console + $ kubectl get pods -n gpu-operator + ``` + + The `nvidia-operator-validator` pod should report `Completed`. + ## Related Information * The preceding procedure is derived from diff --git a/gpu-operator/.agents/skills/gpu-operator-nvidia-dra/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-nvidia-dra/SKILL.md index ecc992c75..38e83fd40 100644 --- a/gpu-operator/.agents/skills/gpu-operator-nvidia-dra/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-nvidia-dra/SKILL.md @@ -19,9 +19,6 @@ tags: -# Prerequisites - -> [!TIP] # NVIDIA DRA Driver for GPUs Dynamic Resource Allocation (DRA) is a Kubernetes concept for flexibly requesting, configuring, and sharing specialized devices like GPUs. @@ -33,6 +30,17 @@ Before using the DRA Driver for GPUs, it is recommended that you are familiar wi * [Upstream Kubernetes DRA documentation](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/). * [DRA Driver repository documentation](https://github.com/NVIDIA/k8s-dra-driver-gpu) +## Prerequisites + +> [!TIP] +> You can use the NVIDIA DRA Driver for GPUs ComputeDomain and GPU allocation independently or together in the same cluster. They have different prerequisites; to use both features together, configure your cluster to meet the prerequisites for both. + +For GPU allocation with the GPU Operator: + +- Kubernetes v1.34.2 or newer. If you plan to use traditional extended resource requests such as `nvidia.com/gpu` with the DRA driver, enable the [`DRAExtendedResource`](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#extended-resource) feature gate. +- GPU Operator v25.10.0 or later with the NVIDIA Kubernetes Device Plugin disabled to avoid conflicts with the DRA Driver for GPUs. The DRA Driver requires Container Device Interface (CDI) enabled in the container runtime and NVIDIA Driver version 580 or later, both of which are default in GPU Operator v25.10.0 and later. +- Label the nodes you plan to use for GPU allocation (for example, `nvidia.com/dra-kubelet-plugin=true`) and use them as node selectors in the DRA driver Helm chart. + ## Overview With NVIDIA's DRA Driver for GPUs, your Kubernetes workload can allocate and consume the following two types of resources: diff --git a/gpu-operator/.agents/skills/gpu-operator-nvidia-google/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-nvidia-google/SKILL.md index 0902207a4..98f3aebf1 100644 --- a/gpu-operator/.agents/skills/gpu-operator-nvidia-google/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-nvidia-google/SKILL.md @@ -18,13 +18,14 @@ tags: -# Prerequisites - -* You installed and initialized the Google Cloud CLI. +# NVIDIA GPU Operator with Google GKE -- name: RUNTIME_CONFIG_SOURCE +## Prerequisites -# NVIDIA GPU Operator with Google GKE +- You installed and initialized the Google Cloud CLI. Refer to [gcloud CLI overview](https://cloud.google.com/sdk/gcloud) in the Google Cloud documentation. +- You have a Google Cloud project to use for your GKE cluster. Refer to [Creating and managing projects](https://cloud.google.com/resource-manager/docs/creating-managing-projects) in the Google Cloud documentation. +- You have the project ID for your Google Cloud project. Refer to [Identifying projects](https://cloud.google.com/resource-manager/docs/creating-managing-projects#identifying_projects) in the Google Cloud documentation. +- You know the machine type for the node pool and that the machine type is supported in your region and zone. Refer to [GPU platforms](https://cloud.google.com/compute/docs/gpus) in the Google Cloud documentation. ## About Using the Operator with Google GKE @@ -34,9 +35,11 @@ or you can use the Operator and driver manager to manage the driver and other NV The choice depends on the operating system and whether you prefer to have the Operator manage all the software components. -| Google Driver Installer - | Container-Optimized OS | Ubuntu with containerd | The Google driver installer manages the NVIDIA GPU Driver. NVIDIA GPU Operator manages other software components. | -| --- | --- | --- | --- | -| NVIDIA Driver Manager - | Ubuntu with containerd | NVIDIA GPU Operator manages the lifecycle and upgrades of the driver and other NVIDIA software. | | +| Approach | Supported OS | Summary | +| --- | --- | --- | +| Google Driver Installer | Container-Optimized OS, Ubuntu with containerd | The Google driver installer manages the NVIDIA GPU Driver. NVIDIA GPU Operator manages other software components. | +| NVIDIA Driver Manager | Ubuntu with containerd | NVIDIA GPU Operator manages the lifecycle and upgrades of the driver and other NVIDIA software. | + The preceding information relates to using GKE Standard node pools. For Autopilot Pods, using the GPU Operator is not supported, and you can refer to [Deploy GPU workloads in Autopilot](https://cloud.google.com/kubernetes-engine/docs/how-to/autopilot-gpus). @@ -243,8 +246,25 @@ The steps create the cluster with a node pool that uses a Ubuntu and containerd gpu-operator-quota 38s pods: 0/100 ``` -1. Install the Operator. - Refer to install the NVIDIA GPU Operator. +1. Install the Operator (use the `gpu-operator-install` skill). + +## Verification + +After installing the Operator, confirm that the GPU nodes are managed and operands are healthy: + +1. Confirm the GPU nodes advertise GPU capacity: + + ```console + $ kubectl get nodes -o json | jq '.items[].status.capacity."nvidia.com/gpu"' + ``` + +1. Confirm the GPU Operator pods are running: + + ```console + $ kubectl get pods -n gpu-operator + ``` + + The `nvidia-operator-validator` pod should report `Completed`. ## Related Information diff --git a/gpu-operator/.agents/skills/gpu-operator-precompiled-drivers/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-precompiled-drivers/SKILL.md index 0a8bd6a99..e60d18386 100644 --- a/gpu-operator/.agents/skills/gpu-operator-precompiled-drivers/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-precompiled-drivers/SKILL.md @@ -20,6 +20,12 @@ tags: # Precompiled Driver Containers +## Prerequisites + +- A running Kubernetes cluster with NVIDIA GPU worker nodes. +- The `kubectl` and `helm` CLIs available on a client machine. +- A supported operating system for which NVIDIA publishes precompiled driver containers. Refer to the [GPU Operator Component Matrix](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/life-cycle-policy.html#gpu-operator-component-matrix) for supported operating systems. + ## About Precompiled Driver Containers Containers with precompiled drivers do not require internet access to download Linux kernel diff --git a/gpu-operator/.agents/skills/gpu-operator-upgrading-nvidia/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-upgrading-nvidia/SKILL.md index 27c0f7b25..57e21c941 100644 --- a/gpu-operator/.agents/skills/gpu-operator-upgrading-nvidia/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-upgrading-nvidia/SKILL.md @@ -18,11 +18,16 @@ tags: -# Prerequisites +# Upgrading the NVIDIA GPU Operator -- If your cluster uses Pod Security Admission (PSA) to restrict the behavior of pods, +## Prerequisites -# Upgrading the NVIDIA GPU Operator +- A Kubernetes cluster with an existing NVIDIA GPU Operator installation and the `kubectl` and `helm` CLIs available. +- If your cluster uses Pod Security Admission (PSA) to restrict the behavior of pods, label the namespace for the Operator to set the enforcement policy to privileged: + + ```console + $ kubectl label --overwrite ns gpu-operator pod-security.kubernetes.io/enforce=privileged + ``` ## Using Helm @@ -183,8 +188,20 @@ After the edits are complete, Kubernetes will automatically apply the updates to ## Additional Controls for Driver Upgrades While most of the GPU Operator managed daemonsets can be upgraded seamlessly, the NVIDIA driver daemonset has special considerations. -Refer to GPU Driver Upgrades for more information. +Refer to the GPU driver upgrade behavior (use the `gpu-operator-driver-upgrades` skill) for more information. ## Using Operator Lifecycle Manager (OLM) in OpenShift For upgrading the GPU Operator when running in OpenShift, refer to the official OpenShift documentation on [upgrading installed operators](https://docs.redhat.com/en/documentation/openshift_container_platform/latest/html/operators/administrator-tasks#olm-upgrading-operators). + +## Verification + +After upgrading, confirm that the Operator and its operands are healthy: + +1. Confirm all GPU Operator pods are running or completed: + + ```console + $ kubectl get pods -n gpu-operator + ``` + + The `nvidia-operator-validator` pod should report `Completed`, and the driver, toolkit, and device-plugin pods should report `Running` on the expected GPU nodes. From 5e18605eed741bf6a7e8e2c320689bfcc31847ae Mon Sep 17 00:00:00 2001 From: Andrew Chen Date: Mon, 1 Jun 2026 04:12:41 -0700 Subject: [PATCH 07/13] fix: replace hardcoded v26.3.1 with documented placeholder MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit A prior optimization pass replaced the source ${version} Sphinx substitution with a frozen patch version (v26.3.1) across the skills, which freezes a specific release and goes stale. The original #401 finding was a ${version} leak (raw template var in rendered output), so the fix is a non-leaking, non-frozen value. Replace every command/URL/image-tag occurrence of v26.3.1 introduced on this branch with the angle-bracket placeholder (matching the project's existing / convention), and add a brief inline note on first use per file pointing to the GPU Operator releases page. The version-specific factual reference data (release-notes.md changelog heading, life-cycle-policy.md version table) is left intact — that content genuinely describes a specific historical release and was not introduced by the optimization pass. Co-Authored-By: Claude Opus 4.8 (1M context) Signed-off-by: Andrew Chen --- .../gpu-operator-custom-driver/SKILL.md | 7 ++++-- .../gpu-operator-gpudirect-rdma/SKILL.md | 9 +++++--- .../SKILL.md | 17 ++++++++------ .../gpu-operator-install-http-proxy/SKILL.md | 9 +++++--- .../SKILL.md | 5 ++++- .../skills/gpu-operator-install/SKILL.md | 22 +++++++++---------- .../gpu-operator-kata-containers/SKILL.md | 7 ++++-- .../skills/gpu-operator-kubevirt/SKILL.md | 7 ++++-- .../gpu-operator-multiinstance/SKILL.md | 11 ++++++---- .../skills/gpu-operator-nvidia-azure/SKILL.md | 5 ++++- .../skills/gpu-operator-nvidia-dra/SKILL.md | 7 ++++-- .../gpu-operator-nvidia-driver/SKILL.md | 5 ++++- .../gpu-operator-nvidia-google/SKILL.md | 5 ++++- .../gpu-operator-precompiled-drivers/SKILL.md | 5 ++++- .../gpu-operator-timeslicing-gpus/SKILL.md | 5 ++++- .../gpu-operator-upgrading-nvidia/SKILL.md | 7 ++++-- 16 files changed, 89 insertions(+), 44 deletions(-) diff --git a/gpu-operator/.agents/skills/gpu-operator-custom-driver/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-custom-driver/SKILL.md index 76f64e470..c345bc6d3 100644 --- a/gpu-operator/.agents/skills/gpu-operator-custom-driver/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-custom-driver/SKILL.md @@ -53,6 +53,9 @@ To pass custom parameters, execute the following steps. $ kubectl create configmap kernel-module-params -n gpu-operator --from-file=nvidia.conf=./nvidia.conf ``` +> [!NOTE] +> Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). + 1. Install the GPU Operator and set `driver.kernelModuleConfig.name` to the name of the `ConfigMap` containing the kernel module parameters. @@ -60,7 +63,7 @@ To pass custom parameters, execute the following steps. $ helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ - --version=v26.3.1 \ + --version= \ --set driver.kernelModuleConfig.name="kernel-module-params" ``` @@ -90,7 +93,7 @@ Refer to [Simplifying GPU Application Development with Heterogeneous Memory Mana $ helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ - --version=v26.3.1 \ + --version= \ --set driver.kernelModuleConfig.name="kernel-module-params" ``` diff --git a/gpu-operator/.agents/skills/gpu-operator-gpudirect-rdma/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-gpudirect-rdma/SKILL.md index d93f82738..75aa8e29f 100644 --- a/gpu-operator/.agents/skills/gpu-operator-gpudirect-rdma/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-gpudirect-rdma/SKILL.md @@ -104,13 +104,16 @@ For information about the supported versions, refer to Support for GPUDirect RDM ### Installing the GPU Operator and Enabling GPUDirect RDMA +> [!NOTE] +> Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). + To use DMA-BUF and network device drivers that are installed by the Network Operator: ```console $ helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ - --version=v26.3.1 \ + --version= \ ``` To use DMA-BUF and network device drivers that are installed on the host: @@ -119,7 +122,7 @@ To use DMA-BUF and network device drivers that are installed on the host: $ helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ - --version=v26.3.1 \ + --version= \ --set driver.rdma.useHostMofed=true ``` @@ -459,7 +462,7 @@ The following sample command applies to clusters that use the Network Operator t $ helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ - --version=v26.3.1 \ + --version= \ --set gds.enabled=true ``` diff --git a/gpu-operator/.agents/skills/gpu-operator-install-airgapped-environments/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-install-airgapped-environments/SKILL.md index 0257f436b..0080b381b 100644 --- a/gpu-operator/.agents/skills/gpu-operator-install-airgapped-environments/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-install-airgapped-environments/SKILL.md @@ -101,19 +101,22 @@ The general syntax for the container image is `/:`. If the version is not specified, you can retrieve the information from the NVIDIA NGC catalog at https://catalog.ngc.nvidia.com/containers. Search for an image, such as `gpu-operator` and then check the available tags for the image. +> [!NOTE] +> Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). + An example is shown below with the Operator container image: ```yaml operator: repository: nvcr.io/nvidia image: gpu-operator - version: "v26.3.1" + version: "" ``` -For instance, to pull the gpu-operator image version v26.3.1, use the following instruction: +For instance, to pull the gpu-operator image version , use the following instruction: ```console -$ docker pull nvcr.io/nvidia/gpu-operator:v26.3.1 +$ docker pull nvcr.io/nvidia/gpu-operator: ``` There is one caveat with regards to the driver image. The version field must be appended by the OS name running on the worker node. @@ -136,14 +139,14 @@ To push the images to the local registry, simply tag the pulled images by prefix Using the above examples, this will result in: ```console -$ docker tag nvcr.io/nvidia/gpu-operator:v26.3.1 //gpu-operator:v26.3.1 +$ docker tag nvcr.io/nvidia/gpu-operator: //gpu-operator: $ docker tag nvcr.io/nvidia/driver:${recommended}-ubuntu20.04 //driver:${recommended}-ubuntu20.04 ``` Finally, push the images to the local registry: ```console -$ docker push //gpu-operator:v26.3.1 +$ docker push //gpu-operator: $ docker push //driver:${recommended}-ubuntu20.04 ``` @@ -353,7 +356,7 @@ Download and deploy GPU Operator Helm Chart with the updated `values.yaml`. Fetch the chart from the NGC repository: ```console -$ helm fetch https://helm.ngc.nvidia.com/nvidia/charts/gpu-operator-v26.3.1.tgz +$ helm fetch https://helm.ngc.nvidia.com/nvidia/charts/gpu-operator-.tgz ``` Install the GPU Operator with the customized `values.yaml`: @@ -361,7 +364,7 @@ Install the GPU Operator with the customized `values.yaml`: ```console $ helm install --wait gpu-operator \ -n gpu-operator --create-namespace \ - gpu-operator-v26.3.1.tgz \ + gpu-operator-.tgz \ -f values.yaml ``` diff --git a/gpu-operator/.agents/skills/gpu-operator-install-http-proxy/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-install-http-proxy/SKILL.md index 2b4ee2ce5..862bbe1d8 100644 --- a/gpu-operator/.agents/skills/gpu-operator-install-http-proxy/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-install-http-proxy/SKILL.md @@ -53,10 +53,13 @@ based on information present in the cluster-wide Proxy object. ## HTTP Proxy Configuration +> [!NOTE] +> Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). + First, get the `values.yaml` file used for GPU Operator configuration: ```console -$ curl -sO https://raw.githubusercontent.com/NVIDIA/gpu-operator/v26.3.1/deployments/gpu-operator/values.yaml +$ curl -sO https://raw.githubusercontent.com/NVIDIA/gpu-operator//deployments/gpu-operator/values.yaml ``` Specify `driver.env` in `values.yaml` with appropriate HTTP_PROXY, HTTPS_PROXY, and NO_PROXY environment variables @@ -90,7 +93,7 @@ Download and deploy GPU Operator Helm Chart with the updated `values.yaml`. Fetch the chart from the NGC repository: ```console -$ helm fetch https://helm.ngc.nvidia.com/nvidia/charts/gpu-operator-v26.3.1.tgz +$ helm fetch https://helm.ngc.nvidia.com/nvidia/charts/gpu-operator-.tgz ``` Install the GPU Operator with updated `values.yaml`: @@ -98,7 +101,7 @@ Install the GPU Operator with updated `values.yaml`: ```console $ helm install --wait gpu-operator \ -n gpu-operator --create-namespace \ - gpu-operator-v26.3.1.tgz \ + gpu-operator-.tgz \ -f values.yaml ``` diff --git a/gpu-operator/.agents/skills/gpu-operator-install-outdated-kernels/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-install-outdated-kernels/SKILL.md index aa0c8ade9..76d96ac2a 100644 --- a/gpu-operator/.agents/skills/gpu-operator-install-outdated-kernels/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-install-outdated-kernels/SKILL.md @@ -102,13 +102,16 @@ driver: destinationDir: /etc/yum.repos.d ``` +> [!NOTE] +> Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). + Deploy GPU Operator with updated `values.yaml`: ```console $ helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ - --version=v26.3.1 \ + --version= \ -f values.yaml ``` diff --git a/gpu-operator/.agents/skills/gpu-operator-install/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-install/SKILL.md index 785d6ab6e..7c0ed9107 100644 --- a/gpu-operator/.agents/skills/gpu-operator-install/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-install/SKILL.md @@ -22,7 +22,7 @@ tags: # Installing the NVIDIA GPU Operator -The current patch release of this version of the NVIDIA GPU Operator is `v26.3.1`. +Throughout this skill, replace `` with your target GPU Operator release (for example, the latest patch release listed on the [GPU Operator releases page](https://github.com/NVIDIA/gpu-operator/releases)). > [!TIP] > For installation on Red Hat OpenShift Container Platform, refer to [OpenShift installation steps](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/steps-overview.html). @@ -84,7 +84,7 @@ The current patch release of this version of the NVIDIA GPU Operator is `v26.3.1 $ helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ - --version=v26.3.1 + --version= ``` - Install the Operator and specify configuration options: @@ -93,7 +93,7 @@ The current patch release of this version of the NVIDIA GPU Operator is `v26.3.1 $ helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ - --version=v26.3.1 \ + --version= \ --set = ``` @@ -159,7 +159,7 @@ For example, to install the GPU Operator in the `nvidia-gpu-operator` namespace: $ helm install --wait --generate-name \ -n nvidia-gpu-operator --create-namespace \ nvidia/gpu-operator \ - --version=v26.3.1 \ + --version= \ ``` If you do not specify a namespace during installation, all GPU Operator components are installed in the `default` namespace. @@ -193,7 +193,7 @@ In this scenario, use the NVIDIA Container Toolkit image that is built on UBI 8: $ helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ - --version=v26.3.1 \ + --version= \ --set toolkit.version=v1.16.1-ubi8 ``` @@ -213,7 +213,7 @@ In this scenario, the NVIDIA GPU driver is already installed on the worker nodes $ helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ - --version=v26.3.1 \ + --version= \ --set driver.enabled=false ``` @@ -238,7 +238,7 @@ Install the Operator with the following options: $ helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ - --version=v26.3.1 \ + --version= \ --set driver.enabled=false \ --set toolkit.enabled=false ``` @@ -259,7 +259,7 @@ In this scenario, the NVIDIA Container Toolkit is already installed on the worke $ helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ - --version=v26.3.1 \ + --version= \ --set toolkit.enabled=false ``` @@ -287,7 +287,7 @@ you can build a custom driver container image. Follow these steps: $ helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ - --version=v26.3.1 \ + --version= \ --set driver.repository=docker.io/nvidia \ --set driver.version="465.27" ``` @@ -321,7 +321,7 @@ If you need to specify custom values, refer to the following sample command for ```console helm install gpu-operator -n gpu-operator --create-namespace \ nvidia/gpu-operator $HELM_OPTIONS \ - --version=v26.3.1 \ + --version= \ --set toolkit.env[0].name=CONTAINERD_CONFIG \ --set toolkit.env[0].value=/etc/containerd/containerd.toml \ --set toolkit.env[1].name=CONTAINERD_SOCKET \ @@ -391,7 +391,7 @@ These options can be passed to GPU Operator during install time as below. ```console helm install gpu-operator -n gpu-operator --create-namespace \ nvidia/gpu-operator $HELM_OPTIONS \ - --version=v26.3.1 \ + --version= \ --set toolkit.env[0].name=CONTAINERD_CONFIG \ --set toolkit.env[0].value=/var/snap/microk8s/current/args/containerd-template.toml \ --set toolkit.env[1].name=CONTAINERD_SOCKET \ diff --git a/gpu-operator/.agents/skills/gpu-operator-kata-containers/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-kata-containers/SKILL.md index 960a9298f..42bbf8e41 100644 --- a/gpu-operator/.agents/skills/gpu-operator-kata-containers/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-kata-containers/SKILL.md @@ -342,6 +342,9 @@ Install the NVIDIA GPU Operator and configure it to deploy Kata Container compon Update Complete. ⎈Happy Helming!⎈ ``` +> [!NOTE] +> Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). + 1. Install the GPU Operator. The following configures the GPU Operator to deploy the operands that are required for Kata Containers. Refer to Common Chart Customization Options for more details on the additional configuration options you can specify when installing the GPU Operator. @@ -350,7 +353,7 @@ Install the NVIDIA GPU Operator and configure it to deploy Kata Container compon $ helm install --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ - --version=v26.3.1 \ + --version= \ --set sandboxWorkloads.enabled=true \ --set sandboxWorkloads.mode=kata \ --set nfd.enabled=true \ @@ -446,7 +449,7 @@ The following example installs the GPU Operator with both `P_GPU_ALIAS` and `NVS $ helm install --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ - --version=v26.3.1 \ + --version= \ --set sandboxWorkloads.enabled=true \ --set sandboxWorkloads.mode=kata \ --set nfd.enabled=true \ diff --git a/gpu-operator/.agents/skills/gpu-operator-kubevirt/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-kubevirt/SKILL.md index 511bc8d64..d20aff45a 100644 --- a/gpu-operator/.agents/skills/gpu-operator-kubevirt/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-kubevirt/SKILL.md @@ -140,13 +140,16 @@ The term *sandboxing* refers to running software in a separate isolated environm We use the term `sandbox workloads` to signify workloads that run in a virtual machine, irrespective of the virtualization technology used. #### Install the GPU Operator without NVIDIA vGPU +> [!NOTE] +> Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). + Install the GPU Operator, enabling `sandboxWorkloads`: ```console $ helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ - --version=v26.3.1 \ + --version= \ --set sandboxWorkloads.enabled=true ``` @@ -176,7 +179,7 @@ Follow the steps provided in this section. $ helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ - --version=v26.3.1 \ + --version= \ --set sandboxWorkloads.enabled=true \ --set vgpuManager.enabled=true \ --set vgpuManager.repository= \ diff --git a/gpu-operator/.agents/skills/gpu-operator-multiinstance/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-multiinstance/SKILL.md index e2c048e9f..0ebff5f33 100644 --- a/gpu-operator/.agents/skills/gpu-operator-multiinstance/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-multiinstance/SKILL.md @@ -40,13 +40,16 @@ Refer to the [Multi-Instance GPU architecture](https://docs.nvidia.com/datacente Use the following steps to enable MIG and deploy MIG Manager. +> [!NOTE] +> Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). + 1. Install the Operator: ```console $ helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ - --version=v26.3.1 \ + --version= \ --set mig.strategy=single ``` @@ -316,7 +319,7 @@ In your values.yaml file, set `migManager.config.create` to `true`, set `migMana ```console $ helm upgrade --install gpu-operator -n gpu-operator --create-namespace \ - nvidia/gpu-operator --version=v26.3.1 \ + nvidia/gpu-operator --version= \ -f values.yaml ``` @@ -449,7 +452,7 @@ can be used to install the GPU Operator: $ helm install gpu-operator \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ - --version=v26.3.1 \ + --version= \ --set driver.enabled=false ``` @@ -498,7 +501,7 @@ Alternatively, you can create a custom ConfigMap for use by MIG Manager by perfo $ helm install gpu-operator \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ - --version=v26.3.1 \ + --version= \ --set migManager.gpuClientsConfig.name=gpu-clients \ --set driver.enabled=false ``` diff --git a/gpu-operator/.agents/skills/gpu-operator-nvidia-azure/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-nvidia-azure/SKILL.md index a2b0781c8..a2f4d33a7 100644 --- a/gpu-operator/.agents/skills/gpu-operator-nvidia-azure/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-nvidia-azure/SKILL.md @@ -95,12 +95,15 @@ deploying NVIDIA Driver Containers and the NVIDIA Container Toolkit. && helm repo update ``` +> [!NOTE] +> Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). + 1. Install the Operator without the driver containers and toolkit: ```console $ helm install gpu-operator nvidia/gpu-operator \ -n gpu-operator --create-namespace \ - --version=v26.3.1 \ + --version= \ --set driver.enabled=false \ --set toolkit.enabled=false \ --set operator.runtimeClass=nvidia-container-runtime diff --git a/gpu-operator/.agents/skills/gpu-operator-nvidia-dra/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-nvidia-dra/SKILL.md index 38e83fd40..5eb640ae9 100644 --- a/gpu-operator/.agents/skills/gpu-operator-nvidia-dra/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-nvidia-dra/SKILL.md @@ -75,11 +75,14 @@ You can use the NVIDIA DRA Driver for GPUs with the NVIDIA GPU Operator to deplo && helm repo update ``` +> [!NOTE] +> Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). + 3. Install the GPU Operator with the NVIDIA Kubernetes Device Plugin disabled: ```console helm upgrade --install gpu-operator nvidia/gpu-operator \ - --version=v26.3.1 \ + --version= \ --create-namespace \ --namespace gpu-operator \ --set devicePlugin.enabled=false \ @@ -101,7 +104,7 @@ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \ ```console helm upgrade --install gpu-operator nvidia/gpu-operator \ - --version=v26.3.1 \ + --version= \ --create-namespace \ --namespace gpu-operator ``` diff --git a/gpu-operator/.agents/skills/gpu-operator-nvidia-driver/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-nvidia-driver/SKILL.md index fd0738fc6..e4a1c0f59 100644 --- a/gpu-operator/.agents/skills/gpu-operator-nvidia-driver/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-nvidia-driver/SKILL.md @@ -171,13 +171,16 @@ Perform the following steps to install the GPU Operator and use the NVIDIA drive && helm repo update ``` +> [!NOTE] +> Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). + - Install the Operator and specify at least the `--set driver.nvidiaDriverCRD.enabled=true` argument: ```console $ helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ - --version=v26.3.1 \ + --version= \ --set driver.nvidiaDriverCRD.enabled=true ``` diff --git a/gpu-operator/.agents/skills/gpu-operator-nvidia-google/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-nvidia-google/SKILL.md index 98f3aebf1..c081483ec 100644 --- a/gpu-operator/.agents/skills/gpu-operator-nvidia-google/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-nvidia-google/SKILL.md @@ -132,13 +132,16 @@ You can create a node pool that uses a Container-Optimized OS node image or a Ub [Manually install NVIDIA GPU drivers](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers) in the GKE documentation. +> [!NOTE] +> Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). + 1. Install the Operator using Helm: ```console $ helm install --wait --generate-name \ -n gpu-operator \ nvidia/gpu-operator \ - --version=v26.3.1 \ + --version= \ --set hostPaths.driverInstallDir=/home/kubernetes/bin/nvidia \ --set toolkit.installDir=/home/kubernetes/bin/nvidia \ --set cdi.enabled=true \ diff --git a/gpu-operator/.agents/skills/gpu-operator-precompiled-drivers/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-precompiled-drivers/SKILL.md index e60d18386..6f7fbeae3 100644 --- a/gpu-operator/.agents/skills/gpu-operator-precompiled-drivers/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-precompiled-drivers/SKILL.md @@ -99,6 +99,9 @@ Use one of the following ways to check if a driver container is available for yo ## Enabling Precompiled Driver Container Support During Installation +> [!NOTE] +> Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). + Refer to the common instructions for installing the Operator with Helm at install-gpu-operator. Specify the `--set driver.usePrecompiled=true` and `--set driver.version=` arguments like the following example command: @@ -106,7 +109,7 @@ Specify the `--set driver.usePrecompiled=true` and `--set driver.version= \ --set driver.usePrecompiled=true \ --set driver.version="" ``` diff --git a/gpu-operator/.agents/skills/gpu-operator-timeslicing-gpus/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-timeslicing-gpus/SKILL.md index c85222830..e7725c752 100644 --- a/gpu-operator/.agents/skills/gpu-operator-timeslicing-gpus/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-timeslicing-gpus/SKILL.md @@ -322,12 +322,15 @@ Perform the following steps to configure time-slicing before installing the oper $ kubectl create -f time-slicing-config.yaml ``` +> [!NOTE] +> Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). + 1. Install the operator with Helm: ```console $ helm install gpu-operator nvidia/gpu-operator \ -n gpu-operator \ - --version=v26.3.1 \ + --version= \ --set devicePlugin.config.name=time-slicing-config ``` diff --git a/gpu-operator/.agents/skills/gpu-operator-upgrading-nvidia/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-upgrading-nvidia/SKILL.md index 57e21c941..19425afb6 100644 --- a/gpu-operator/.agents/skills/gpu-operator-upgrading-nvidia/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-upgrading-nvidia/SKILL.md @@ -51,10 +51,13 @@ you can upgrade the GPU Operator chart manually or by enabling a Helm hook. With this procedure, all existing GPU Operator resources are updated inline and the cluster policy resource is patched with updates from `values.yaml`. +> [!NOTE] +> Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). + 1. Specify the Operator release tag in an environment variable: ```console - $ export RELEASE_TAG=v26.3.1 + $ export RELEASE_TAG= ``` 1. Apply the custom resource definitions for the cluster policy and NVIDIA driver: @@ -136,7 +139,7 @@ Starting with GPU Operator v24.9.0, the upgrade CRD Helm hook is enabled by defa 1. Specify the Operator release tag in an environment variable: ```console - $ export RELEASE_TAG=v26.3.1 + $ export RELEASE_TAG= ``` 1. Update the information about the Operator chart: From 12716644ee8cbc5158109e45bb0787e68a8e5c80 Mon Sep 17 00:00:00 2001 From: Andrew Chen Date: Mon, 1 Jun 2026 14:44:18 -0700 Subject: [PATCH 08/13] refactor(skills): restructure 7 procedural GPU Operator skills to information-hiding dispatch layout Convert the 7 largest procedural skills (450-583 lines each) into a thin dispatch-layer SKILL.md (each <200 lines, all under ~55 non-frontmatter lines) plus phase-specific references/*.md files. Step-by-step command sequences, manifests, field tables, and verification output move out of SKILL.md into references/, so the dispatch layer cannot leak later-phase procedural detail (structural no-skip property; no tooling dependency). All verified-fix content from the prior optimization pass (prerequisites, verification sections, fixed cross-refs, placeholders) is preserved and relocated, not removed. Skills restructured (before -> after SKILL.md raw line count): - gpu-operator-kata-containers 583 -> 65 - gpu-operator-install 572 -> 71 - gpu-operator-multiinstance 560 -> 62 - gpu-operator-gpudirect-rdma 555 -> 67 - gpu-operator-kubevirt 515 -> 68 - gpu-operator-timeslicing-gpus 492 -> 62 - gpu-operator-nvidia-driver 451 -> 60 Each passes the close-your-eyes dispatch-layer test (leaked cmds, bash blocks, and line count all within thresholds). Internal SKILL.md -> references/*.md links verified to resolve. The remaining 17 procedural skills plus the reference skill are deferred to a later batch. Co-Authored-By: Claude Opus 4.8 (1M context) Signed-off-by: Andrew Chen --- .../gpu-operator-gpudirect-rdma/SKILL.md | 544 +---------------- .../references/concepts.md | 75 +++ .../references/rdma.md | 356 +++++++++++ .../references/storage.md | 107 ++++ .../skills/gpu-operator-install/SKILL.md | 565 +---------------- .../references/chart-options.md | 47 ++ .../references/containerd-config.md | 116 ++++ .../references/deployment-scenarios.md | 156 +++++ .../references/install.md | 39 ++ .../references/prerequisites.md | 42 ++ .../references/verification.md | 165 +++++ .../gpu-operator-kata-containers/SKILL.md | 574 +----------------- .../references/concepts.md | 78 +++ .../references/install.md | 291 +++++++++ .../references/prerequisites.md | 87 +++ .../references/workload.md | 114 ++++ .../skills/gpu-operator-kubevirt/SKILL.md | 507 +--------------- .../references/build-vgpu-manager.md | 60 ++ .../references/concepts.md | 81 +++ .../references/configure-and-install.md | 259 ++++++++ .../references/vgpu-device-config.md | 99 +++ .../gpu-operator-multiinstance/SKILL.md | 552 +---------------- .../references/concepts-and-install.md | 139 +++++ .../references/examples.md | 342 +++++++++++ .../references/preinstalled-drivers.md | 72 +++ .../gpu-operator-nvidia-driver/SKILL.md | 439 +------------- .../references/concepts.md | 125 ++++ .../references/install.md | 47 ++ .../references/manifests.md | 210 +++++++ .../references/upgrade-and-verify.md | 56 ++ .../gpu-operator-timeslicing-gpus/SKILL.md | 484 +-------------- .../references/concepts.md | 113 ++++ .../references/configuration.md | 222 +++++++ .../references/verification.md | 139 +++++ 34 files changed, 3833 insertions(+), 3469 deletions(-) create mode 100644 gpu-operator/.agents/skills/gpu-operator-gpudirect-rdma/references/concepts.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-gpudirect-rdma/references/rdma.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-gpudirect-rdma/references/storage.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-install/references/chart-options.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-install/references/containerd-config.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-install/references/deployment-scenarios.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-install/references/install.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-install/references/prerequisites.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-install/references/verification.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-kata-containers/references/concepts.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-kata-containers/references/install.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-kata-containers/references/prerequisites.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-kata-containers/references/workload.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-kubevirt/references/build-vgpu-manager.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-kubevirt/references/concepts.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-kubevirt/references/configure-and-install.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-kubevirt/references/vgpu-device-config.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-multiinstance/references/concepts-and-install.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-multiinstance/references/examples.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-multiinstance/references/preinstalled-drivers.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-nvidia-driver/references/concepts.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-nvidia-driver/references/install.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-nvidia-driver/references/manifests.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-nvidia-driver/references/upgrade-and-verify.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-timeslicing-gpus/references/concepts.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-timeslicing-gpus/references/configuration.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-timeslicing-gpus/references/verification.md diff --git a/gpu-operator/.agents/skills/gpu-operator-gpudirect-rdma/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-gpudirect-rdma/SKILL.md index 75aa8e29f..d1b7dbcc5 100644 --- a/gpu-operator/.agents/skills/gpu-operator-gpudirect-rdma/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-gpudirect-rdma/SKILL.md @@ -22,6 +22,10 @@ tags: # GPUDirect RDMA and GPUDirect Storage +Enable direct data paths between NVIDIA GPUs and peer devices (RDMA-capable NICs +for GPUDirect RDMA, or storage for GPUDirect Storage) using the GPU Operator +together with the NVIDIA Network Operator. + ## Prerequisites - A running Kubernetes cluster with NVIDIA GPU worker nodes. @@ -29,527 +33,35 @@ tags: - NVIDIA Network Operator installed for RDMA-capable networking, and compatible RDMA-capable NICs on the GPU nodes. - A supported NVIDIA Open GPU Kernel module driver, which is required for GPUDirect Storage. -## About GPUDirect RDMA and GPUDirect Storage - -[GPUDirect RDMA](https://docs.nvidia.com/cuda/gpudirect-rdma/index.html) is a technology in NVIDIA GPUs that enables direct -data exchange between GPUs and a third-party peer device using PCI Express. The third-party devices could be network interfaces -such as NVIDIA ConnectX SmartNICs or BlueField DPUs, or video acquisition adapters. - -[GPUDirect Storage](https://docs.nvidia.com/gpudirect-storage/overview-guide/index.html) (GDS) enables a direct data path between local or remote storage, such as NFS servers or NVMe/NVMe over Fabric (NVMe-oF), and GPU memory. -GDS performs direct memory access (DMA) transfers between GPU memory and storage. -DMA avoids a bounce buffer through the CPU. -This direct path increases system bandwidth and decreases the latency and utilization load on the CPU. - -To support GPUDirect RDMA, userspace CUDA APIs are required. -The kernel mode support is provided by one of two approaches: DMA-BUF from the Linux kernel or the legacy `nvidia-peermem` kernel module. -NVIDIA recommends using the DMA-BUF rather than using the `nvidia-peermem` kernel module from the GPU Driver. - -The Operator uses GDS driver version 2.17.5 or newer. -This version and higher is only supported with the NVIDIA Open GPU Kernel module driver. -In GPU Operator v25.3.0 and later, the `driver.kernelModuleType` default is `auto`, for the supported driver versions. -This configuration allows the GPU Operator to choose the recommended driver kernel module type depending on the driver branch and the GPU devices available. -Newer driver versions will use the open kernel module by default, however to make sure you are using the open kernel module, include `--set driver.kernelModuleType=open` command-line argument in your helm Operator install command. - -In conjunction with the Network Operator, the GPU Operator can be used to -set up the networking related components such as network device kernel drivers and Kubernetes device plugins to enable -workloads to take advantage of GPUDirect RDMA and GPUDirect Storage. -Refer to the Network Operator [documentation](https://docs.nvidia.com/networking/software/cloud-orchestration/index.html) for installation information. - -## Common Prerequisites - -The prerequisites for configuring GPUDirect RDMA or GPUDirect Storage depend on whether you use DMA-BUF from the Linux kernel or the legacy `nvidia-peermem` kernel module. - -| Technology | DMA-BUF | Legacy NVIDIA-peermem | -| --- | --- | --- | -| GPU Driver | An Open Kernel module driver is required. | Any supported driver. | -| CUDA | CUDA 11.7 or higher. The CUDA runtime is provided by the driver. | No minimum version. The CUDA runtime is provided by the driver. | -| GPU | Turing architecture data center, Quadro RTX, and RTX GPU or higher. | All data center, Quadro RTX, and RTX GPU or higher. | -| Network Device Drivers | MLNX_OFED or DOCA-OFED are optional. You can use the Linux driver packages from the package manager. | MLNX_OFED or DOCA-OFED are required. | -| Linux Kernel | 5.12 or higher. | No minimum version. | -* Make sure the network device drivers are installed. - - You can use the [Network Operator](https://docs.nvidia.com/networking/software/cloud-orchestration/index.html) - to manage the driver lifecycle for MLNX_OFED and DOCA-OFED drivers. - - You can install the drivers on each host. - Refer to [Adapter Software](https://docs.nvidia.com/networking/software/adapter-software/index.html) - in the networking documentation for information about the MLNX_OFED, DOCA-OFED, and Linux inbox drivers. - -* For installations on VMware vSphere, refer to the following additional prerequisites: - - * Make sure the network interface controller and the NVIDIA GPU are in the same PCIe IO root complex. - * Enable the following PCI options: - - * `pciPassthru.allowP2P = true` - * `pciPassthru.RelaxACSforP2P = true` - * `pciPassthru.use64bitMMIO = true` - * `pciPassthru.64bitMMIOSizeGB = 128` - - For information about configuring the settings, refer to the - [Deploy an AI-Ready Enterprise Platform on vSphere 7](https://www.vmware.com/docs/deploy-an-ai-ready-enterprise-platform-on-vsphere-7-update-2#vm-settings-A) - document from VMWare. - -## Configuring GPUDirect RDMA - -### Platform Support - -The following platforms are supported for GPUDirect with RDMA: - -* Kubernetes on bare metal and on vSphere VMs with GPU passthrough and vGPU. -* VMware vSphere with Tanzu. -* For Red Hat OpenShift Container Platform on bare metal and on vSphere VMs with GPU passthrough and vGPU configurations, - refer to NVIDIA AI Enterprise with OpenShift. - -For information about the supported versions, refer to Support for GPUDirect RDMA on the platform support page. - -### Installing the GPU Operator and Enabling GPUDirect RDMA - -> [!NOTE] -> Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). - -To use DMA-BUF and network device drivers that are installed by the Network Operator: - -```console -$ helm install --wait --generate-name \ - -n gpu-operator --create-namespace \ - nvidia/gpu-operator \ - --version= \ -``` - -To use DMA-BUF and network device drivers that are installed on the host: - -```console -$ helm install --wait --generate-name \ - -n gpu-operator --create-namespace \ - nvidia/gpu-operator \ - --version= \ - --set driver.rdma.useHostMofed=true -``` - -To use the legacy `nvidia-peermem` kernel module instead of DMA-BUF, add `--set driver.rdma.enabled=true` to either of the preceding commands. -Add `--set driver.kernelModuleType=open` if you are using a driver version from a branch earlier than R570. - -### Verifying the Installation of GPUDirect with RDMA - -During the installation, the NVIDIA driver daemon set runs an `init container` to wait on the network device kernel drivers to be ready. -This init container checks for Mellanox NICs on the node and ensures that the necessary kernel symbols are exported by the kernel drivers. - -If you were required to use the `driver.rdma.enabled=true` argument when you installed the Operator, the nvidia-peermem-ctr container is started inside each driver pod after the verification. - -1. Confirm that the pod template for the driver daemon set includes the mofed-validation init container and - the nvidia-driver-ctr containers: - - ```console - $ kubectl describe ds -n gpu-operator nvidia-driver-daemonset - ``` - - *Example Output* - - The following partial output omits the init containers and containers that are common to all installations. - - ```output - ... - Init Containers: - mofed-validation: - Container ID: containerd://5a36c66b43f676df616e25ba7ae0c81aeaa517308f28ec44e474b2f699218de3 - Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v1.8.1 - Image ID: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:7a70e95fd19c3425cd4394f4b47bbf2119a70bd22d67d72e485b4d730853262c - ... - Containers: - nvidia-driver-ctr: - Container ID: containerd://199a760946c55c3d7254fa0ebe6a6557dd231179057d4909e26c0e6aec49ab0f - Image: nvcr.io/nvaie/vgpu-guest-driver:470.63.01-ubuntu20.04 - Image ID: nvcr.io/nvaie/vgpu-guest-driver@sha256:a1b7d2c8e1bad9bb72d257ddfc5cec341e790901e7574ba2c32acaddaaa94625 - ... - nvidia-peermem-ctr: - Container ID: containerd://0742d86f6017bf0c304b549ebd8caad58084a4185a1225b2c9a7f5c4a171054d - Image: nvcr.io/nvaie/vgpu-guest-driver:470.63.01-ubuntu20.04 - Image ID: nvcr.io/nvaie/vgpu-guest-driver@sha256:a1b7d2c8e1bad9bb72d257ddfc5cec341e790901e7574ba2c32acaddaaa94625 - ... - ``` - - The nvidia-peermem-ctr container is present only if you were required to specify the `driver.rdma.enabled=true` argument when you installed the Operator. - -1. Legacy only: Confirm that the nvidia-peermem-ctr container successfully loaded the nvidia-peermem kernel module: - - ```console - $ kubectl logs -n gpu-operator ds/nvidia-driver-daemonset -c nvidia-peermem-ctr - ``` - - Alternatively, run `kubectl logs -n gpu-operator nvidia-driver-daemonset-xxxxx -c nvidia-peermem-ctr` for each pod in the daemonset. - - *Example Output* - - ```output - waiting for mellanox ofed and nvidia drivers to be installed - waiting for mellanox ofed and nvidia drivers to be installed - successfully loaded nvidia-peermem module - ``` - -### Verifying the Installation by Performing a Data Transfer - -You can perform the following steps to verify that GPUDirect with RDMA is configured -correctly and that pods can perform RDMA data transfers. - -1. Get the network interface name of the InfiniBand device on the host: - - ```console - $ kubectl exec -it -n network-operator mofed-ubuntu22.04-ds-xxxxx -- ibdev2netdev - ``` - - *Example Output* - - ```output - mlx5_0 port 1 ==> ens64np1 (Up) - ``` - -1. Configure a secondary network on the device using a macvlan network attachment: - - - Create a file, such as `demo-macvlannetwork.yaml`, with contents like the following example: - - ```yaml - apiVersion: mellanox.com/v1alpha1 - kind: MacvlanNetwork - metadata: - name: demo-macvlannetwork - spec: - networkNamespace: "default" - master: "ens64np1" - mode: "bridge" - mtu: 1500 - ipam: | - { - "type": "whereabouts", - "range": "192.168.2.225/28", - "exclude": [ - "192.168.2.229/30", - "192.168.2.236/32" - ] - } - ``` - - Replace `ens64np1` with the the network interface name reported by the `ibdev2netdev` command - from the preceding step. - - - Apply the manifest: - - ```console - $ kubectl apply -f demo-macvlannetwork.yaml - ``` - - - Confirm that the additional network is ready: - - ```console - $ kubectl get macvlannetworks demo-macvlannetwork - ``` - - *Example Output* - - ```output - NAME STATUS AGE - demo-macvlannetwork ready 2023-03-10T18:22:28Z - ``` - -1. Start two pods that run the `mellanox/cuda-perftest` container on two different nodes in the cluster. - - ### demo-pod-1 - - - Create a file, such as `demo-pod-1.yaml`, for the first pod with contents like the following: - - ```yaml - apiVersion: v1 - kind: Pod - metadata: - name: demo-pod-1 - annotations: - k8s.v1.cni.cncf.io/networks: demo-macvlannetwork - # If a network with static IPAM is used replace network annotation with the below. - # k8s.v1.cni.cncf.io/networks: '[ - # { "name": "rdma-net", - # "ips": ["192.168.111.101/24"], - # "gateway": ["192.168.111.1"] - # } - # ]' - spec: - nodeSelector: - # Note: Replace hostname or remove selector altogether - kubernetes.io/hostname: nvnode1 - restartPolicy: OnFailure - containers: - - image: mellanox/cuda-perftest - name: rdma-gpu-test-ctr - securityContext: - capabilities: - add: [ "IPC_LOCK" ] - resources: - limits: - nvidia.com/gpu: 1 - rdma/rdma_shared_device_a: 1 - requests: - nvidia.com/gpu: 1 - rdma/rdma_shared_device_a: 1 - ``` - - - Apply the manifest: - - ```console - $ kubectl apply -f demo-pod-1.yaml - ``` - - ### demo-pod-2 - - - Create a file, such as `demo-pod-2.yaml`, for the second pod with contents like the following: - - ```yaml - apiVersion: v1 - kind: Pod - metadata: - name: demo-pod-2 - annotations: - k8s.v1.cni.cncf.io/networks: demo-macvlannetwork - # If a network with static IPAM is used replace network annotation with the below. - # k8s.v1.cni.cncf.io/networks: '[ - # { "name": "rdma-net", - # "ips": ["192.168.111.101/24"], - # "gateway": ["192.168.111.1"] - # } - # ]' - spec: - nodeSelector: - # Note: Replace hostname or remove selector altogether - kubernetes.io/hostname: nvnode2 - restartPolicy: OnFailure - containers: - - image: mellanox/cuda-perftest - name: rdma-gpu-test-ctr - securityContext: - capabilities: - add: [ "IPC_LOCK" ] - resources: - limits: - nvidia.com/gpu: 1 - rdma/rdma_shared_device_a: 1 - requests: - nvidia.com/gpu: 1 - rdma/rdma_shared_device_a: 1 - ``` - - - Apply the manifest: - - ```console - $ kubectl apply -f demo-pod-2.yaml - ``` - -1. Get the IP addresses of the pods: - - ```console - $ kubectl get pods -o wide - ``` - - *Example Output* - - ```output - NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES - demo-pod-1 1/1 Running 0 3d4h 192.168.38.90 nvnode1 - demo-pod-2 1/1 Running 0 3d4h 192.168.47.89 nvnode2 - ``` - -1. From one terminal, open a shell in the container on the first pod and start the performance test server: - - ```console - $ kubectl exec -it demo-pod-1 -- ib_write_bw --use_cuda=0 --use_cuda_dmabuf \ - -d mlx5_0 -a -F --report_gbits -q 1 - ``` - - *Example Output* - - ```output - ************************************ - * Waiting for client to connect... * - ************************************ - ``` - -1. From another terminal, open a shell in the container on the second pod and run the performance client: - - ```console - $ kubectl exec -it demo-pod-2 -- ib_write_bw -n 5000 --use_cuda=0 --use_cuda_dmabuf \ - -d mlx5_0 -a -F --report_gbits -q 1 192.168.38.90 - ``` - - *Example Output* - - ```output - --------------------------------------------------------------------------------------- - RDMA_Write BW Test - Dual-port : OFF Device : mlx5_0 - Number of qps : 1 Transport type : IB - Connection type : RC Using SRQ : OFF - PCIe relax order: ON - ibv_wr* API : ON - TX depth : 128 - CQ Moderation : 100 - Mtu : 1024[B] - Link type : Ethernet - GID index : 5 - Max inline data : 0[B] - rdma_cm QPs : OFF - Data ex. method : Ethernet - --------------------------------------------------------------------------------------- - local address: LID 0000 QPN 0x01ac PSN 0xc76db1 RKey 0x23beb2 VAddr 0x007f26a2c8b000 - GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:02:226 - remote address: LID 0000 QPN 0x01a9 PSN 0x2f722 RKey 0x23beaf VAddr 0x007f820b24f000 - GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:02:225 - --------------------------------------------------------------------------------------- - #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] - 2 5000 0.11 0.11 6.897101 - 4 5000 0.22 0.22 6.995646 - 8 5000 0.45 0.45 7.014752 - 16 5000 0.90 0.90 7.017509 - 32 5000 1.80 1.80 7.020162 - 64 5000 3.59 3.59 7.007110 - 128 5000 7.19 7.18 7.009540 - 256 5000 15.06 14.98 7.313517 - 512 5000 30.04 29.73 7.259329 - 1024 5000 59.65 58.81 7.178529 - 2048 5000 91.53 91.47 5.582931 - 4096 5000 92.13 92.06 2.809574 - 8192 5000 92.35 92.31 1.408535 - 16384 5000 92.46 92.46 0.705381 - 32768 5000 92.36 92.35 0.352302 - 65536 5000 92.39 92.38 0.176196 - 131072 5000 92.42 92.41 0.088131 - 262144 5000 92.45 92.44 0.044080 - 524288 5000 92.42 92.42 0.022034 - 1048576 5000 92.40 92.40 0.011015 - 2097152 5000 92.40 92.39 0.005507 - 4194304 5000 92.40 92.39 0.002753 - 8388608 5000 92.39 92.39 0.001377 - --------------------------------------------------------------------------------------- - ``` - - The command output indicates that the data transfer rate was approximately 92 Gbps. - -1. Delete the pods: - - ```console - $ kubectl delete -f demo-pod-1.yaml -f demo-pod-2.yaml - ``` - -1. Delete the secondary network: - - ```console - $ kubectl delete -f demo-macvlannetworks.yaml - ``` - -## Using GPUDirect Storage - -### Platform Support - -See Support for GPUDirect Storage on the platform support page. - -### Installing the GPU Operator and Enabling GPUDirect Storage - -The following section is applicable to the following configurations and describe how to deploy the GPU Operator using the Helm Chart: - -* Kubernetes on bare metal and on vSphere VMs with GPU passthrough and vGPU. - -Starting with v22.9.1, the GPU Operator provides an option to load the `nvidia-fs` kernel module during the bootstrap of the NVIDIA driver daemon set. -Starting with v23.9.1, the GPU Operator deploys a version of GDS that requires using the NVIDIA Open Kernel module driver. - -The following sample command applies to clusters that use the Network Operator to install the network device kernel drivers. - -```console -$ helm install --wait --generate-name \ - -n gpu-operator --create-namespace \ - nvidia/gpu-operator \ - --version= \ - --set gds.enabled=true -``` - -Add `--set driver.rdma.enabled=true` to the command to use the legacy `nvidia-peermem` kernel module. - -Add `--set driver.kernelModuleType=open` if you are using a driver version from a branch earlier than R570. - -### Verification - -During the installation, an init container is used with the driver daemon set to wait on the network device kernel drivers to be ready. -This init container checks for Mellanox NICs on the node and ensures that the necessary kernel symbols are exported by the kernel drivers. -After the verification completes, the nvidia-fs-ctr container starts inside the driver pods. - -If you were required to use the `driver.rdma.enabled=true` argument when you installed the Operator, the nvidia-peermem-ctr container is started inside each driver pod after the verification. - -```console -$ kubectl get pod -n gpu-operator -``` - -*Example Output* - -```output -gpu-operator gpu-feature-discovery-pktzg 1/1 Running 0 11m -gpu-operator gpu-operator-1672257888-node-feature-discovery-master-7ccb7txmc 1/1 Running 0 12m -gpu-operator gpu-operator-1672257888-node-feature-discovery-worker-bqhrl 1/1 Running 0 11m -gpu-operator gpu-operator-6f64c86bc-zjqdh 1/1 Running 0 12m -gpu-operator nvidia-container-toolkit-daemonset-rgwqg 1/1 Running 0 11m -gpu-operator nvidia-cuda-validator-8whvt 0/1 Completed 0 8m50s -gpu-operator nvidia-dcgm-exporter-pt9q9 1/1 Running 0 11m -gpu-operator nvidia-device-plugin-daemonset-472fc 1/1 Running 0 11m -gpu-operator nvidia-device-plugin-validator-29nhc 0/1 Completed 0 8m34s -gpu-operator nvidia-driver-daemonset-j9vw6 3/3 Running 0 12m -gpu-operator nvidia-mig-manager-mtjcw 1/1 Running 0 7m35s -gpu-operator nvidia-operator-validator-b8nz2 1/1 Running 0 11m -``` - -```console -$ kubectl describe pod -n gpu-operator nvidia-driver-daemonset-xxxx - - Init Containers: - mofed-validation: - Container ID: containerd://a31a8c16ce7596073fef7cb106da94c452fdff111879e7fc3ec58b9cef83856a - Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.1 - Image ID: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:18c9ea88ae06d479e6657b8a4126a8ee3f4300a40c16ddc29fb7ab3763d46005 - - - Containers: - nvidia-driver-ctr: - Container ID: containerd://7cf162e4ee4af865c0be2023d61fbbf68c828d396207e7eab2506f9c2a5238a4 - Image: nvcr.io/nvidia/driver:525.60.13-ubuntu20.04 - Image ID: nvcr.io/nvidia/driver@sha256:0ee0c585fa720f177734b3295a073f402d75986c1fe018ae68bd73fe9c21b8d8 - - - nvidia-peermem-ctr: - Container ID: containerd://5c71c9f8ccb719728a0503500abecfb5423e8088f474d686ee34b5fe3746c28e - Image: nvcr.io/nvidia/driver:525.60.13-ubuntu20.04 - Image ID: nvcr.io/nvidia/driver@sha256:0ee0c585fa720f177734b3295a073f402d75986c1fe018ae68bd73fe9c21b8d8 - - - nvidia-fs-ctr: - Container ID: containerd://f5c597d59e1cf8747aa20b8c229a6f6edd3ed588b9d24860209ba0cc009c0850 - Image: nvcr.io/nvidia/cloud-native/nvidia-fs:2.14.13-ubuntu20.04 - Image ID: nvcr.io/nvidia/cloud-native/nvidia-fs@sha256:109485365f68caeaee1edee0f3f4d722fe5b5d7071811fc81c630c8a840b847b - - -``` +> The full kernel-mode requirements (DMA-BUF vs legacy `nvidia-peermem`) and the per-technology prerequisite matrix are in [references/concepts.md](references/concepts.md). -Lastly, verify that NVIDIA kernel modules are loaded on the worker node: +## Activation -```console -$ lsmod | grep nvidia +Do this first: identify which phase the user's request maps to in the Phases +table below, then **read the corresponding `references/.md` file before +acting**. All Helm/kubectl command sequences, manifests, and expected +verification output live only in those reference files — do not improvise +commands from this dispatch layer. -nvidia_fs 245760 0 -nvidia_peermem 16384 0 -nvidia_modeset 1159168 0 -nvidia_uvm 1048576 0 -nvidia 39059456 115 nvidia_uvm,nvidia_modeset -ib_core 319488 9 rdma_cm,ib_ipoib,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm -drm 491520 6 drm_kms_helper,drm_vram_helper,nvidia,mgag200,ttm -``` +## Phases -## Related Information +| Phase | Summary | Reference | +|-------|---------|-----------| +| Concepts | What GPUDirect RDMA and GPUDirect Storage are, the DMA-BUF vs legacy `nvidia-peermem` kernel-mode approaches, the per-technology prerequisite matrix, vSphere requirements, and related links. | [references/concepts.md](references/concepts.md) | +| GPUDirect RDMA | Platform support, installing the GPU Operator with RDMA enabled (DMA-BUF or legacy), verifying the driver daemon set, and verifying with an end-to-end `ib_write_bw` data transfer between two pods. | [references/rdma.md](references/rdma.md) | +| GPUDirect Storage | Platform support, installing the GPU Operator with `gds.enabled=true`, and verifying that the `nvidia-fs` module and driver pods are loaded. | [references/storage.md](references/storage.md) | -Refer to the following resources for more information: +## Hard rules (apply across all phases) - * GPUDirect RDMA: https://docs.nvidia.com/cuda/gpudirect-rdma/index.html +- NVIDIA recommends DMA-BUF over the legacy `nvidia-peermem` kernel module; only add `--set driver.rdma.enabled=true` when you specifically need the legacy module. +- GPUDirect Storage (GDS 2.17.5+) requires the NVIDIA Open GPU Kernel module driver. +- Add `--set driver.kernelModuleType=open` if you are using a driver version from a branch earlier than R570. +- Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). Never hardcode a version. - * NVIDIA Network Operator: https://github.com/Mellanox/network-operator +## Verification - * Blog post on deploying the Network Operator: https://developer.nvidia.com/blog/deploying-gpudirect-rdma-on-egx-stack-with-the-network-operator/ +For RDMA, run the two-pod `ib_write_bw` data transfer and confirm a high +sustained transfer rate. For GDS, confirm the `nvidia-fs` kernel module is +loaded and the driver pods are `Running`. Exact commands and expected output +are in [references/rdma.md](references/rdma.md) and +[references/storage.md](references/storage.md). diff --git a/gpu-operator/.agents/skills/gpu-operator-gpudirect-rdma/references/concepts.md b/gpu-operator/.agents/skills/gpu-operator-gpudirect-rdma/references/concepts.md new file mode 100644 index 000000000..50bed2ae7 --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-gpudirect-rdma/references/concepts.md @@ -0,0 +1,75 @@ + + + +# GPUDirect RDMA and GPUDirect Storage: Concepts and Common Prerequisites + +## About GPUDirect RDMA and GPUDirect Storage + +[GPUDirect RDMA](https://docs.nvidia.com/cuda/gpudirect-rdma/index.html) is a technology in NVIDIA GPUs that enables direct +data exchange between GPUs and a third-party peer device using PCI Express. The third-party devices could be network interfaces +such as NVIDIA ConnectX SmartNICs or BlueField DPUs, or video acquisition adapters. + +[GPUDirect Storage](https://docs.nvidia.com/gpudirect-storage/overview-guide/index.html) (GDS) enables a direct data path between local or remote storage, such as NFS servers or NVMe/NVMe over Fabric (NVMe-oF), and GPU memory. +GDS performs direct memory access (DMA) transfers between GPU memory and storage. +DMA avoids a bounce buffer through the CPU. +This direct path increases system bandwidth and decreases the latency and utilization load on the CPU. + +To support GPUDirect RDMA, userspace CUDA APIs are required. +The kernel mode support is provided by one of two approaches: DMA-BUF from the Linux kernel or the legacy `nvidia-peermem` kernel module. +NVIDIA recommends using the DMA-BUF rather than using the `nvidia-peermem` kernel module from the GPU Driver. + +The Operator uses GDS driver version 2.17.5 or newer. +This version and higher is only supported with the NVIDIA Open GPU Kernel module driver. +In GPU Operator v25.3.0 and later, the `driver.kernelModuleType` default is `auto`, for the supported driver versions. +This configuration allows the GPU Operator to choose the recommended driver kernel module type depending on the driver branch and the GPU devices available. +Newer driver versions will use the open kernel module by default, however to make sure you are using the open kernel module, include `--set driver.kernelModuleType=open` command-line argument in your helm Operator install command. + +In conjunction with the Network Operator, the GPU Operator can be used to +set up the networking related components such as network device kernel drivers and Kubernetes device plugins to enable +workloads to take advantage of GPUDirect RDMA and GPUDirect Storage. +Refer to the Network Operator [documentation](https://docs.nvidia.com/networking/software/cloud-orchestration/index.html) for installation information. + +## Common Prerequisites + +The prerequisites for configuring GPUDirect RDMA or GPUDirect Storage depend on whether you use DMA-BUF from the Linux kernel or the legacy `nvidia-peermem` kernel module. + +| Technology | DMA-BUF | Legacy NVIDIA-peermem | +| --- | --- | --- | +| GPU Driver | An Open Kernel module driver is required. | Any supported driver. | +| CUDA | CUDA 11.7 or higher. The CUDA runtime is provided by the driver. | No minimum version. The CUDA runtime is provided by the driver. | +| GPU | Turing architecture data center, Quadro RTX, and RTX GPU or higher. | All data center, Quadro RTX, and RTX GPU or higher. | +| Network Device Drivers | MLNX_OFED or DOCA-OFED are optional. You can use the Linux driver packages from the package manager. | MLNX_OFED or DOCA-OFED are required. | +| Linux Kernel | 5.12 or higher. | No minimum version. | + +* Make sure the network device drivers are installed. + + You can use the [Network Operator](https://docs.nvidia.com/networking/software/cloud-orchestration/index.html) + to manage the driver lifecycle for MLNX_OFED and DOCA-OFED drivers. + + You can install the drivers on each host. + Refer to [Adapter Software](https://docs.nvidia.com/networking/software/adapter-software/index.html) + in the networking documentation for information about the MLNX_OFED, DOCA-OFED, and Linux inbox drivers. + +* For installations on VMware vSphere, refer to the following additional prerequisites: + + * Make sure the network interface controller and the NVIDIA GPU are in the same PCIe IO root complex. + * Enable the following PCI options: + + * `pciPassthru.allowP2P = true` + * `pciPassthru.RelaxACSforP2P = true` + * `pciPassthru.use64bitMMIO = true` + * `pciPassthru.64bitMMIOSizeGB = 128` + + For information about configuring the settings, refer to the + [Deploy an AI-Ready Enterprise Platform on vSphere 7](https://www.vmware.com/docs/deploy-an-ai-ready-enterprise-platform-on-vsphere-7-update-2#vm-settings-A) + document from VMWare. + +## Related Information + +Refer to the following resources for more information: + + * GPUDirect RDMA: https://docs.nvidia.com/cuda/gpudirect-rdma/index.html + + * NVIDIA Network Operator: https://github.com/Mellanox/network-operator + + * Blog post on deploying the Network Operator: https://developer.nvidia.com/blog/deploying-gpudirect-rdma-on-egx-stack-with-the-network-operator/ diff --git a/gpu-operator/.agents/skills/gpu-operator-gpudirect-rdma/references/rdma.md b/gpu-operator/.agents/skills/gpu-operator-gpudirect-rdma/references/rdma.md new file mode 100644 index 000000000..7b0348fef --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-gpudirect-rdma/references/rdma.md @@ -0,0 +1,356 @@ + + + +# Configuring GPUDirect RDMA + +Throughout, replace `` with your target GPU Operator release. + +## Platform Support + +The following platforms are supported for GPUDirect with RDMA: + +* Kubernetes on bare metal and on vSphere VMs with GPU passthrough and vGPU. +* VMware vSphere with Tanzu. +* For Red Hat OpenShift Container Platform on bare metal and on vSphere VMs with GPU passthrough and vGPU configurations, + refer to NVIDIA AI Enterprise with OpenShift. + +For information about the supported versions, refer to Support for GPUDirect RDMA on the platform support page. + +## Installing the GPU Operator and Enabling GPUDirect RDMA + +> [!NOTE] +> Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). + +To use DMA-BUF and network device drivers that are installed by the Network Operator: + +```console +$ helm install --wait --generate-name \ + -n gpu-operator --create-namespace \ + nvidia/gpu-operator \ + --version= \ +``` + +To use DMA-BUF and network device drivers that are installed on the host: + +```console +$ helm install --wait --generate-name \ + -n gpu-operator --create-namespace \ + nvidia/gpu-operator \ + --version= \ + --set driver.rdma.useHostMofed=true +``` + +To use the legacy `nvidia-peermem` kernel module instead of DMA-BUF, add `--set driver.rdma.enabled=true` to either of the preceding commands. +Add `--set driver.kernelModuleType=open` if you are using a driver version from a branch earlier than R570. + +## Verifying the Installation of GPUDirect with RDMA + +During the installation, the NVIDIA driver daemon set runs an `init container` to wait on the network device kernel drivers to be ready. +This init container checks for Mellanox NICs on the node and ensures that the necessary kernel symbols are exported by the kernel drivers. + +If you were required to use the `driver.rdma.enabled=true` argument when you installed the Operator, the nvidia-peermem-ctr container is started inside each driver pod after the verification. + +1. Confirm that the pod template for the driver daemon set includes the mofed-validation init container and + the nvidia-driver-ctr containers: + + ```console + $ kubectl describe ds -n gpu-operator nvidia-driver-daemonset + ``` + + *Example Output* + + The following partial output omits the init containers and containers that are common to all installations. + + ```output + ... + Init Containers: + mofed-validation: + Container ID: containerd://5a36c66b43f676df616e25ba7ae0c81aeaa517308f28ec44e474b2f699218de3 + Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v1.8.1 + Image ID: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:7a70e95fd19c3425cd4394f4b47bbf2119a70bd22d67d72e485b4d730853262c + ... + Containers: + nvidia-driver-ctr: + Container ID: containerd://199a760946c55c3d7254fa0ebe6a6557dd231179057d4909e26c0e6aec49ab0f + Image: nvcr.io/nvaie/vgpu-guest-driver:470.63.01-ubuntu20.04 + Image ID: nvcr.io/nvaie/vgpu-guest-driver@sha256:a1b7d2c8e1bad9bb72d257ddfc5cec341e790901e7574ba2c32acaddaaa94625 + ... + nvidia-peermem-ctr: + Container ID: containerd://0742d86f6017bf0c304b549ebd8caad58084a4185a1225b2c9a7f5c4a171054d + Image: nvcr.io/nvaie/vgpu-guest-driver:470.63.01-ubuntu20.04 + Image ID: nvcr.io/nvaie/vgpu-guest-driver@sha256:a1b7d2c8e1bad9bb72d257ddfc5cec341e790901e7574ba2c32acaddaaa94625 + ... + ``` + + The nvidia-peermem-ctr container is present only if you were required to specify the `driver.rdma.enabled=true` argument when you installed the Operator. + +1. Legacy only: Confirm that the nvidia-peermem-ctr container successfully loaded the nvidia-peermem kernel module: + + ```console + $ kubectl logs -n gpu-operator ds/nvidia-driver-daemonset -c nvidia-peermem-ctr + ``` + + Alternatively, run `kubectl logs -n gpu-operator nvidia-driver-daemonset-xxxxx -c nvidia-peermem-ctr` for each pod in the daemonset. + + *Example Output* + + ```output + waiting for mellanox ofed and nvidia drivers to be installed + waiting for mellanox ofed and nvidia drivers to be installed + successfully loaded nvidia-peermem module + ``` + +## Verifying the Installation by Performing a Data Transfer + +You can perform the following steps to verify that GPUDirect with RDMA is configured +correctly and that pods can perform RDMA data transfers. + +1. Get the network interface name of the InfiniBand device on the host: + + ```console + $ kubectl exec -it -n network-operator mofed-ubuntu22.04-ds-xxxxx -- ibdev2netdev + ``` + + *Example Output* + + ```output + mlx5_0 port 1 ==> ens64np1 (Up) + ``` + +1. Configure a secondary network on the device using a macvlan network attachment: + + - Create a file, such as `demo-macvlannetwork.yaml`, with contents like the following example: + + ```yaml + apiVersion: mellanox.com/v1alpha1 + kind: MacvlanNetwork + metadata: + name: demo-macvlannetwork + spec: + networkNamespace: "default" + master: "ens64np1" + mode: "bridge" + mtu: 1500 + ipam: | + { + "type": "whereabouts", + "range": "192.168.2.225/28", + "exclude": [ + "192.168.2.229/30", + "192.168.2.236/32" + ] + } + ``` + + Replace `ens64np1` with the the network interface name reported by the `ibdev2netdev` command + from the preceding step. + + - Apply the manifest: + + ```console + $ kubectl apply -f demo-macvlannetwork.yaml + ``` + + - Confirm that the additional network is ready: + + ```console + $ kubectl get macvlannetworks demo-macvlannetwork + ``` + + *Example Output* + + ```output + NAME STATUS AGE + demo-macvlannetwork ready 2023-03-10T18:22:28Z + ``` + +1. Start two pods that run the `mellanox/cuda-perftest` container on two different nodes in the cluster. + + ### demo-pod-1 + + - Create a file, such as `demo-pod-1.yaml`, for the first pod with contents like the following: + + ```yaml + apiVersion: v1 + kind: Pod + metadata: + name: demo-pod-1 + annotations: + k8s.v1.cni.cncf.io/networks: demo-macvlannetwork + # If a network with static IPAM is used replace network annotation with the below. + # k8s.v1.cni.cncf.io/networks: '[ + # { "name": "rdma-net", + # "ips": ["192.168.111.101/24"], + # "gateway": ["192.168.111.1"] + # } + # ]' + spec: + nodeSelector: + # Note: Replace hostname or remove selector altogether + kubernetes.io/hostname: nvnode1 + restartPolicy: OnFailure + containers: + - image: mellanox/cuda-perftest + name: rdma-gpu-test-ctr + securityContext: + capabilities: + add: [ "IPC_LOCK" ] + resources: + limits: + nvidia.com/gpu: 1 + rdma/rdma_shared_device_a: 1 + requests: + nvidia.com/gpu: 1 + rdma/rdma_shared_device_a: 1 + ``` + + - Apply the manifest: + + ```console + $ kubectl apply -f demo-pod-1.yaml + ``` + + ### demo-pod-2 + + - Create a file, such as `demo-pod-2.yaml`, for the second pod with contents like the following: + + ```yaml + apiVersion: v1 + kind: Pod + metadata: + name: demo-pod-2 + annotations: + k8s.v1.cni.cncf.io/networks: demo-macvlannetwork + # If a network with static IPAM is used replace network annotation with the below. + # k8s.v1.cni.cncf.io/networks: '[ + # { "name": "rdma-net", + # "ips": ["192.168.111.101/24"], + # "gateway": ["192.168.111.1"] + # } + # ]' + spec: + nodeSelector: + # Note: Replace hostname or remove selector altogether + kubernetes.io/hostname: nvnode2 + restartPolicy: OnFailure + containers: + - image: mellanox/cuda-perftest + name: rdma-gpu-test-ctr + securityContext: + capabilities: + add: [ "IPC_LOCK" ] + resources: + limits: + nvidia.com/gpu: 1 + rdma/rdma_shared_device_a: 1 + requests: + nvidia.com/gpu: 1 + rdma/rdma_shared_device_a: 1 + ``` + + - Apply the manifest: + + ```console + $ kubectl apply -f demo-pod-2.yaml + ``` + +1. Get the IP addresses of the pods: + + ```console + $ kubectl get pods -o wide + ``` + + *Example Output* + + ```output + NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES + demo-pod-1 1/1 Running 0 3d4h 192.168.38.90 nvnode1 + demo-pod-2 1/1 Running 0 3d4h 192.168.47.89 nvnode2 + ``` + +1. From one terminal, open a shell in the container on the first pod and start the performance test server: + + ```console + $ kubectl exec -it demo-pod-1 -- ib_write_bw --use_cuda=0 --use_cuda_dmabuf \ + -d mlx5_0 -a -F --report_gbits -q 1 + ``` + + *Example Output* + + ```output + ************************************ + * Waiting for client to connect... * + ************************************ + ``` + +1. From another terminal, open a shell in the container on the second pod and run the performance client: + + ```console + $ kubectl exec -it demo-pod-2 -- ib_write_bw -n 5000 --use_cuda=0 --use_cuda_dmabuf \ + -d mlx5_0 -a -F --report_gbits -q 1 192.168.38.90 + ``` + + *Example Output* + + ```output + --------------------------------------------------------------------------------------- + RDMA_Write BW Test + Dual-port : OFF Device : mlx5_0 + Number of qps : 1 Transport type : IB + Connection type : RC Using SRQ : OFF + PCIe relax order: ON + ibv_wr* API : ON + TX depth : 128 + CQ Moderation : 100 + Mtu : 1024[B] + Link type : Ethernet + GID index : 5 + Max inline data : 0[B] + rdma_cm QPs : OFF + Data ex. method : Ethernet + --------------------------------------------------------------------------------------- + local address: LID 0000 QPN 0x01ac PSN 0xc76db1 RKey 0x23beb2 VAddr 0x007f26a2c8b000 + GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:02:226 + remote address: LID 0000 QPN 0x01a9 PSN 0x2f722 RKey 0x23beaf VAddr 0x007f820b24f000 + GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:02:225 + --------------------------------------------------------------------------------------- + #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] + 2 5000 0.11 0.11 6.897101 + 4 5000 0.22 0.22 6.995646 + 8 5000 0.45 0.45 7.014752 + 16 5000 0.90 0.90 7.017509 + 32 5000 1.80 1.80 7.020162 + 64 5000 3.59 3.59 7.007110 + 128 5000 7.19 7.18 7.009540 + 256 5000 15.06 14.98 7.313517 + 512 5000 30.04 29.73 7.259329 + 1024 5000 59.65 58.81 7.178529 + 2048 5000 91.53 91.47 5.582931 + 4096 5000 92.13 92.06 2.809574 + 8192 5000 92.35 92.31 1.408535 + 16384 5000 92.46 92.46 0.705381 + 32768 5000 92.36 92.35 0.352302 + 65536 5000 92.39 92.38 0.176196 + 131072 5000 92.42 92.41 0.088131 + 262144 5000 92.45 92.44 0.044080 + 524288 5000 92.42 92.42 0.022034 + 1048576 5000 92.40 92.40 0.011015 + 2097152 5000 92.40 92.39 0.005507 + 4194304 5000 92.40 92.39 0.002753 + 8388608 5000 92.39 92.39 0.001377 + --------------------------------------------------------------------------------------- + ``` + + The command output indicates that the data transfer rate was approximately 92 Gbps. + +1. Delete the pods: + + ```console + $ kubectl delete -f demo-pod-1.yaml -f demo-pod-2.yaml + ``` + +1. Delete the secondary network: + + ```console + $ kubectl delete -f demo-macvlannetworks.yaml + ``` diff --git a/gpu-operator/.agents/skills/gpu-operator-gpudirect-rdma/references/storage.md b/gpu-operator/.agents/skills/gpu-operator-gpudirect-rdma/references/storage.md new file mode 100644 index 000000000..8d4984f4a --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-gpudirect-rdma/references/storage.md @@ -0,0 +1,107 @@ + + + +# Using GPUDirect Storage + +Throughout, replace `` with your target GPU Operator release. + +## Platform Support + +See Support for GPUDirect Storage on the platform support page. + +## Installing the GPU Operator and Enabling GPUDirect Storage + +The following section is applicable to the following configurations and describe how to deploy the GPU Operator using the Helm Chart: + +* Kubernetes on bare metal and on vSphere VMs with GPU passthrough and vGPU. + +Starting with v22.9.1, the GPU Operator provides an option to load the `nvidia-fs` kernel module during the bootstrap of the NVIDIA driver daemon set. +Starting with v23.9.1, the GPU Operator deploys a version of GDS that requires using the NVIDIA Open Kernel module driver. + +The following sample command applies to clusters that use the Network Operator to install the network device kernel drivers. + +```console +$ helm install --wait --generate-name \ + -n gpu-operator --create-namespace \ + nvidia/gpu-operator \ + --version= \ + --set gds.enabled=true +``` + +Add `--set driver.rdma.enabled=true` to the command to use the legacy `nvidia-peermem` kernel module. + +Add `--set driver.kernelModuleType=open` if you are using a driver version from a branch earlier than R570. + +## Verification + +During the installation, an init container is used with the driver daemon set to wait on the network device kernel drivers to be ready. +This init container checks for Mellanox NICs on the node and ensures that the necessary kernel symbols are exported by the kernel drivers. +After the verification completes, the nvidia-fs-ctr container starts inside the driver pods. + +If you were required to use the `driver.rdma.enabled=true` argument when you installed the Operator, the nvidia-peermem-ctr container is started inside each driver pod after the verification. + +```console +$ kubectl get pod -n gpu-operator +``` + +*Example Output* + +```output +gpu-operator gpu-feature-discovery-pktzg 1/1 Running 0 11m +gpu-operator gpu-operator-1672257888-node-feature-discovery-master-7ccb7txmc 1/1 Running 0 12m +gpu-operator gpu-operator-1672257888-node-feature-discovery-worker-bqhrl 1/1 Running 0 11m +gpu-operator gpu-operator-6f64c86bc-zjqdh 1/1 Running 0 12m +gpu-operator nvidia-container-toolkit-daemonset-rgwqg 1/1 Running 0 11m +gpu-operator nvidia-cuda-validator-8whvt 0/1 Completed 0 8m50s +gpu-operator nvidia-dcgm-exporter-pt9q9 1/1 Running 0 11m +gpu-operator nvidia-device-plugin-daemonset-472fc 1/1 Running 0 11m +gpu-operator nvidia-device-plugin-validator-29nhc 0/1 Completed 0 8m34s +gpu-operator nvidia-driver-daemonset-j9vw6 3/3 Running 0 12m +gpu-operator nvidia-mig-manager-mtjcw 1/1 Running 0 7m35s +gpu-operator nvidia-operator-validator-b8nz2 1/1 Running 0 11m +``` + +```console +$ kubectl describe pod -n gpu-operator nvidia-driver-daemonset-xxxx + + Init Containers: + mofed-validation: + Container ID: containerd://a31a8c16ce7596073fef7cb106da94c452fdff111879e7fc3ec58b9cef83856a + Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.1 + Image ID: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:18c9ea88ae06d479e6657b8a4126a8ee3f4300a40c16ddc29fb7ab3763d46005 + + + Containers: + nvidia-driver-ctr: + Container ID: containerd://7cf162e4ee4af865c0be2023d61fbbf68c828d396207e7eab2506f9c2a5238a4 + Image: nvcr.io/nvidia/driver:525.60.13-ubuntu20.04 + Image ID: nvcr.io/nvidia/driver@sha256:0ee0c585fa720f177734b3295a073f402d75986c1fe018ae68bd73fe9c21b8d8 + + + nvidia-peermem-ctr: + Container ID: containerd://5c71c9f8ccb719728a0503500abecfb5423e8088f474d686ee34b5fe3746c28e + Image: nvcr.io/nvidia/driver:525.60.13-ubuntu20.04 + Image ID: nvcr.io/nvidia/driver@sha256:0ee0c585fa720f177734b3295a073f402d75986c1fe018ae68bd73fe9c21b8d8 + + + nvidia-fs-ctr: + Container ID: containerd://f5c597d59e1cf8747aa20b8c229a6f6edd3ed588b9d24860209ba0cc009c0850 + Image: nvcr.io/nvidia/cloud-native/nvidia-fs:2.14.13-ubuntu20.04 + Image ID: nvcr.io/nvidia/cloud-native/nvidia-fs@sha256:109485365f68caeaee1edee0f3f4d722fe5b5d7071811fc81c630c8a840b847b + + +``` + +Lastly, verify that NVIDIA kernel modules are loaded on the worker node: + +```console +$ lsmod | grep nvidia + +nvidia_fs 245760 0 +nvidia_peermem 16384 0 +nvidia_modeset 1159168 0 +nvidia_uvm 1048576 0 +nvidia 39059456 115 nvidia_uvm,nvidia_modeset +ib_core 319488 9 rdma_cm,ib_ipoib,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm +drm 491520 6 drm_kms_helper,drm_vram_helper,nvidia,mgag200,ttm +``` diff --git a/gpu-operator/.agents/skills/gpu-operator-install/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-install/SKILL.md index 7c0ed9107..cf27836fd 100644 --- a/gpu-operator/.agents/skills/gpu-operator-install/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-install/SKILL.md @@ -22,551 +22,50 @@ tags: # Installing the NVIDIA GPU Operator -Throughout this skill, replace `` with your target GPU Operator release (for example, the latest patch release listed on the [GPU Operator releases page](https://github.com/NVIDIA/gpu-operator/releases)). +Install the NVIDIA GPU Operator in a Kubernetes cluster with Helm, including +prerequisites, chart customization options, common deployment scenarios, +containerd configuration, and verification with sample GPU workloads. > [!TIP] > For installation on Red Hat OpenShift Container Platform, refer to [OpenShift installation steps](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/steps-overview.html). ## Prerequisites -1. You have the `kubectl` and `helm` CLIs available on a client machine. +- The `kubectl` and `helm` CLIs available on a client machine. +- Worker nodes configured with a container engine such as CRI-O or containerd. +- If using ClusterPolicy-managed drivers, all GPU worker nodes must run the same OS version (or pre-install the driver). With the driver CRD or pre-installed drivers, mixed OS versions are allowed. +- If the cluster uses Pod Security Admission, label the Operator namespace `pod-security.kubernetes.io/enforce=privileged`. +- Node Feature Discovery (NFD) is required; the Operator deploys it by default. If NFD is already running, set `nfd.enabled=false` at install. - You can run the following commands to install the Helm CLI: +> Full prerequisite detail (Helm install, PSA labeling, NFD detection) is in [references/prerequisites.md](references/prerequisites.md). - ```console - $ curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \ - && chmod 700 get_helm.sh \ - && ./get_helm.sh - ``` +## Activation -1. If you are planning to use ClusterPolicy for driver configuration, all worker nodes or node groups to run GPU workloads in the Kubernetes cluster must run the same operating system version to use the NVIDIA GPU Driver container. - Alternatively, if you pre-install the NVIDIA GPU Driver on the nodes, then you can run different operating systems. +Do this first: identify which phase the user's request maps to in the Phases +table below, then **read the corresponding `references/.md` file before +acting**. All Helm command sequences, the full chart-options table, sample +manifests, and expected output live only in those reference files — do not +improvise commands from this dispatch layer. - For worker nodes or node groups that run CPU workloads only, the nodes can run any operating system because the GPU Operator does not perform any configuration or management of nodes for CPU-only workloads. +## Phases - If you are planning to use the NVIDIA GPU Driver Custom Resource Definition, you can use a mix of operating system versions on CPU and GPU nodes. Refer to the NVIDIA GPU Driver Custom Resource Definition (use the `gpu-operator-nvidia-driver` skill) page for more information. +| Phase | Summary | Reference | +|-------|---------|-----------| +| Prerequisites | CLI tools, OS-version constraints, container engine, Pod Security Admission labeling, and Node Feature Discovery detection. | [references/prerequisites.md](references/prerequisites.md) | +| Install | Add the NVIDIA Helm repo and install the Operator with the default or a `--set`-customized configuration. | [references/install.md](references/install.md) | +| Chart options | Full table of the most frequently used `--set` Helm chart customization parameters and defaults. | [references/chart-options.md](references/chart-options.md) | +| Deployment scenarios | Namespace selection, excluding operands/driver on some nodes, RHEL, pre-installed drivers and/or toolkit, and custom driver images. | [references/deployment-scenarios.md](references/deployment-scenarios.md) | +| containerd config | `toolkit.env` configuration for containerd, RKE2, and MicroK8s, plus the commercially supported platforms table. | [references/containerd-config.md](references/containerd-config.md) | +| Verification | Run the CUDA VectorAdd and Jupyter Notebook sample workloads to confirm GPU scheduling. | [references/verification.md](references/verification.md) | -1. Nodes must be configured with a container engine such as CRI-O or containerd. +## Hard rules (apply across all phases) -1. If your cluster uses Pod Security Admission (PSA) to restrict the behavior of pods, label the namespace for the Operator to set the enforcement policy to privileged: +- Replace `` with your target GPU Operator release (for example, the latest patch release on the [GPU Operator releases page](https://github.com/NVIDIA/gpu-operator/releases)). Never hardcode a specific version. +- The Operator and its operands install into the same namespace; choose it at install time (`default` if unspecified). +- If NFD already runs in the cluster, install with `nfd.enabled=false` to avoid a duplicate deployment. - ```console - $ kubectl create ns gpu-operator - $ kubectl label --overwrite ns gpu-operator pod-security.kubernetes.io/enforce=privileged - ``` +## Verification -1. Node Feature Discovery (NFD) is a dependency for the Operator on each node. - By default, NFD master and worker are automatically deployed by the Operator. - If NFD is already running in the cluster, then you must disable deploying NFD when you install the Operator. - - One way to determine if NFD is already running in the cluster is to check for an NFD label on your nodes: - - ```console - $ kubectl get nodes -o json | jq '.items[].metadata.labels | keys | any(startswith("feature.node.kubernetes.io"))' - ``` - - If the command output is `true`, then NFD is already running in the cluster. - -## Procedure - -1. Add the NVIDIA Helm repository: - - ```console - $ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \ - && helm repo update - ``` - -1. Install the GPU Operator. - - - Install the Operator with the default configuration: - - ```console - $ helm install --wait --generate-name \ - -n gpu-operator --create-namespace \ - nvidia/gpu-operator \ - --version= - ``` - - - Install the Operator and specify configuration options: - - ```console - $ helm install --wait --generate-name \ - -n gpu-operator --create-namespace \ - nvidia/gpu-operator \ - --version= \ - --set = - ``` - - Refer to the **Common Chart Customization Options** and **Common Deployment Scenarios** sections below for more information. - -## Common Chart Customization Options - -The following options are available when using the Helm chart. -These options can be used with `--set` when installing with Helm. - -The following table identifies the most frequently used options. -To view all the options, run `helm show values nvidia/gpu-operator`. - -| Parameter | Description | Default | -| --- | --- | --- | -| `ccManager.enabled` | When set to `true`, the Operator deploys NVIDIA Confidential Computing Manager for Kubernetes. | `false` | -| `cdi.enabled` | When set to `true` (default), the Container Device Interface (CDI) will be used for injecting GPUs into workload containers. The Operator will no longer configure the `nvidia` runtime class as the default runtime handler. Instead, native-CDI support in container runtimes like containerd or cri-o will be leveraged for injecting GPUs into workload containers. Refer to the Container Device Interface page (use the `gpu-operator-container-device` skill) for more information. | `true` | -| `cdi.nriPluginEnabled` | When set to `true`, the Node Resource Interface (NRI) Plugin will be used for injecting GPUs into workload containers. In NRI Plugin mode, the NVIDIA Container Toolkit will no longer modify the runtime config. This feature requires containerd v1.7.30, v2.1.x, or v2.2.x. Refer to the Container Device Interface page (use the `gpu-operator-container-device` skill) for more information. | `false` | -| `cdi.default` Deprecated. | This field is deprecated as of v25.10.0 and will be ignored. The `cdi.enabled` field is set to `true` by default in versions 25.10.0 and later. When set to `true`, the container runtime uses CDI to perform device injection by default. | `false` | -| `daemonsets.annotations` | Map of custom annotations to add to all GPU Operator managed pods. | `{}` | -| `daemonsets.labels` | Map of custom labels to add to all GPU Operator managed pods. | `{}` | -| `dcgmExporter.enabled` | By default, the Operator gathers GPU telemetry in Kubernetes using [DCGM Exporter](https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/dcgm-exporter.html). Set this value to `false` to disable it. Available values are `true` (default) or `false`. | `true` | -| `dcgmExporter.service.internalTrafficPolicy` | Specifies the [internalTrafficPolicy](https://kubernetes.io/docs/concepts/services-networking/service/#traffic-policies) for the DCGM Exporter service. Available values are `Cluster` (default) or `Local`. | `Cluster` | -| `dcgmExporter.hostNetwork` | When set to `true`, the DCGM Exporter will expose a metric port on the host's network namespace. | `false` | -| `devicePlugin.config` | Specifies the configuration for the NVIDIA Device Plugin as a config map. In most cases, this field is configured after installing the Operator, such as to configure GPU time-slicing (use the `gpu-operator-timeslicing-gpus` skill). | `{}` | -| `driver.enabled` | By default, the Operator deploys NVIDIA drivers as a container on the system. Set this value to `false` when using the Operator on systems with pre-installed drivers. | `true` | -| `driver.image` | Name of the NVIDIA Driver Container image to use. | `driver` | -| `driver.imagePullSecrets` | List of the image pull secret used for pulling the driver container image from the registry. | None | -| `driver.kernelModuleType` | Specifies the type of the NVIDIA GPU Kernel modules to use. Valid values are `auto` (default), `proprietary`, and `open`. `Auto` means that the recommended kernel module type (open or proprietary) is chosen based on the GPU devices on the host and the driver branch used. The `auto` option is only supported with the 570.86.15 and 570.124.06 or later driver containers. 550 and 535 branch drivers do not yet support this mode. `Open` means the open kernel module is used. `Proprietary` means the proprietary module is used. | `auto` | -| `driver.nvidiaDriverCRD.enabled` | When set to `true`, the Operator deploys NVIDIA GPU Driver Custom Resource Definition. Refer to the NVIDIA GPU Driver Custom Resource Definition (use the `gpu-operator-nvidia-driver` skill) page for more information. | `false` | -| `driver.repository` | The images are downloaded from NGC. Specify another image repository when using custom driver images. | `nvcr.io/nvidia` | -| `driver.rdma.enabled` | Controls whether the driver daemon set builds and loads the legacy `nvidia-peermem` kernel module. You might be able to use GPUDirect RDMA without enabling this option. Refer to the GPUDirect RDMA page (use the `gpu-operator-gpudirect-rdma` skill) for information about whether you can use DMA-BUF or you need to use legacy `nvidia-peermem`. | `false` | -| `driver.rdma.useHostMofed` | Indicate if MLNX_OFED (MOFED) drivers are pre-installed on the host. | `false` | -| `driver.secretEnv` | The name of the secret to the driver container. A common use case is to use this field to pass your Ubuntu Pro token secret if you are deploying the GPU Operator with government-ready components. Refer to the government-ready installation page (use the `gpu-operator-install-governmentready-environments` skill) for more information. | None | -| `driver.startupProbe` | By default, the driver container has an initial delay of `60s` before starting liveness probes. The probe runs the `nvidia-smi` command with a timeout duration of `60s`. You can increase the `timeoutSeconds` duration if the `nvidia-smi` command runs slowly in your cluster. | `60s` | -| `driver.useOpenKernelModules` Deprecated. | This field is deprecated as of v25.3.0 and will be ignored. Use `kernelModuleType` instead. When set to `true`, the driver containers install the NVIDIA Open GPU Kernel module driver. | `false` | -| `driver.usePrecompiled` | When set to `true`, the Operator attempts to deploy driver containers that have precompiled kernel drivers. Refer to the precompiled driver containers (use the `gpu-operator-precompiled-drivers` skill) page for the supported operating systems. | `false` | -| `driver.version` | Version of the NVIDIA datacenter driver supported by the Operator. If you set `driver.usePrecompiled` to `true`, then set this field to a driver branch, such as `525`. | Depends on the version of the Operator. Refer to the [GPU Operator Component Matrix](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/life-cycle-policy.html#gpu-operator-component-matrix) for more information on supported drivers. | -| `gdrcopy.enabled` | Enables support for GDRCopy. When set to `true`, the GDRCopy Driver runs as a sidecar container in the GPU driver pod. For information about GDRCopy, refer to the [gdrcopy](https://developer.nvidia.com/gdrcopy) page. You can enable GDRCopy if you use the NVIDIA GPU Driver custom resource (use the `gpu-operator-nvidia-driver` skill). | `false` | -| `mig.strategy` | Controls the strategy to be used with MIG on supported NVIDIA GPUs. Options are either `mixed` or `single`. | `single` | -| `migManager.enabled` | The MIG manager watches for changes to the MIG geometry and applies reconfiguration as needed. By default, the MIG manager only runs on nodes with GPUs that support MIG (such as the A100). | `true` | -| `nfd.enabled` | Deploys Node Feature Discovery plugin as a daemonset. Set this variable to `false` if NFD is already running in the cluster. | `true` | -| `nfd.nodefeaturerules` | Installs node feature rules that are related to confidential computing. NFD uses the rules to detect security features in CPUs and NVIDIA GPUs. Set this variable to `true` when you configure the Operator for Confidential Containers. | `false` | -| `operator.labels` | Map of custom labels that will be added to all GPU Operator managed pods. | `{}` | -| `psp.enabled` | The GPU Operator deploys `PodSecurityPolicies` if enabled. | `false` | -| `sandboxWorkloads.enabled` | Specifies if sandbox containers are enabled. | `false` | -| `sandboxWorkloads.defaultWorkload` | Specifies the default type of workload for the cluster, one of `container`, `vm-passthrough`, or `vm-vgpu`. Setting `vm-passthrough` or `vm-vgpu` can be helpful if you plan to run all or mostly virtual machines in your cluster. Refer to KubeVirt (use the `gpu-operator-kubevirt` skill), Kata Containers (use the `gpu-operator-kata-containers` skill) for more details on deploying different workload containers. | `container` | -| `sandboxWorkloads.mode` | Specifies the sandbox mode to use when deploying sandbox workloads. Accepted values are `kubevirt` (default) and `kata`. Refer to the KubeVirt (use the `gpu-operator-kubevirt` skill) or the Kata Containers (use the `gpu-operator-kata-containers` skill) pages for more information on using KubeVirt or Kata based workloads. | `kubevirt` | -| `toolkit.enabled` | By default, the Operator deploys the NVIDIA Container Toolkit (`nvidia-docker2` stack) as a container on the system. Set this value to `false` when using the Operator on systems with pre-installed NVIDIA runtimes. | `true` | - -## Common Deployment Scenarios - -The following common deployment scenarios and sample commands apply best to -bare metal hosts or virtual machines with GPU passthrough. - -### Specifying the Operator Namespace - -Both the Operator and operands are installed in the same namespace. -The namespace is configurable and is specified during installation. -For example, to install the GPU Operator in the `nvidia-gpu-operator` namespace: - -```console -$ helm install --wait --generate-name \ - -n nvidia-gpu-operator --create-namespace \ - nvidia/gpu-operator \ - --version= \ -``` - -If you do not specify a namespace during installation, all GPU Operator components are installed in the `default` namespace. - -### Preventing Installation of Operands on Some Nodes - -By default, the GPU Operator operands are deployed on all GPU worker nodes in the cluster. -GPU worker nodes are identified by the presence of the label `feature.node.kubernetes.io/pci-10de.present=true`. -The value `0x10de` is the PCI vendor ID that is assigned to NVIDIA. - -To disable operands from getting deployed on a GPU worker node, label the node with `nvidia.com/gpu.deploy.operands=false`. - -```console -$ kubectl label nodes $NODE nvidia.com/gpu.deploy.operands=false -``` - -### Preventing Installation of NVIDIA GPU Driver on Some Nodes - -By default, the GPU Operator deploys the driver on all GPU worker nodes in the cluster. -To prevent installing the driver on a GPU worker node, label the node like the following sample command. - -```console -$ kubectl label nodes $NODE nvidia.com/gpu.deploy.driver=false -``` - -### Installation on Red Hat Enterprise Linux - -In this scenario, use the NVIDIA Container Toolkit image that is built on UBI 8: - -```console -$ helm install --wait --generate-name \ - -n gpu-operator --create-namespace \ - nvidia/gpu-operator \ - --version= \ - --set toolkit.version=v1.16.1-ubi8 -``` - -Replace the `v1.16.1` value in the preceding command with the version that is supported -with the NVIDIA GPU Operator. -Refer to the [GPU Operator Component Matrix](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/life-cycle-policy.html#gpu-operator-component-matrix) on the platform support page. - -When using RHEL8 with Kubernetes, SELinux must be enabled either in permissive or enforcing mode for use with the GPU Operator. -Additionally, when using RHEL8 with containerd as the runtime and SELinux is enabled (either in permissive or enforcing mode) at the host level, containerd must also be configured for SELinux, by setting the `enable_selinux=true` configuration option. -Network restricted environments are not supported. - -### Pre-Installed NVIDIA GPU Drivers - -In this scenario, the NVIDIA GPU driver is already installed on the worker nodes that have GPUs: - -```console -$ helm install --wait --generate-name \ - -n gpu-operator --create-namespace \ - nvidia/gpu-operator \ - --version= \ - --set driver.enabled=false -``` - -The preceding command prevents the Operator from installing the GPU driver on any nodes in the cluster. - -If you do not specify the `driver.enabled=false` argument and nodes in the cluster have a pre-installed GPU driver, the init container in the driver pod detects that the driver is preinstalled and labels the node so that the driver pod is terminated and does not get re-scheduled on to the node. -The Operator proceeds to start other pods, such as the container toolkit pod. - -### Pre-Installed NVIDIA GPU Drivers and NVIDIA Container Toolkit - -In this scenario, the NVIDIA GPU driver and the NVIDIA Container Toolkit are already installed on -the worker nodes that have GPUs. - -> [!TIP] -> This scenario applies to NVIDIA DGX Systems that run NVIDIA Base OS. -> Before installing the Operator, ensure that the default runtime is set to `nvidia`. -> Refer to the [NVIDIA Container Toolkit configuration](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) documentation for more information. - -Install the Operator with the following options: - -```console -$ helm install --wait --generate-name \ - -n gpu-operator --create-namespace \ - nvidia/gpu-operator \ - --version= \ - --set driver.enabled=false \ - --set toolkit.enabled=false -``` - -### Pre-Installed NVIDIA Container Toolkit (but no drivers) - -In this scenario, the NVIDIA Container Toolkit is already installed on the worker nodes that have GPUs. - -1. Configure toolkit to use the `root` directory of the driver installation as `/run/nvidia/driver`, because this is the path mounted by driver container. - - ```console - $ sudo sed -i 's/^#root/root/' /etc/nvidia-container-runtime/config.toml - ``` - -1. Install the Operator with the following options (which will provision a driver): - - ```console - $ helm install --wait --generate-name \ - -n gpu-operator --create-namespace \ - nvidia/gpu-operator \ - --version= \ - --set toolkit.enabled=false - ``` - -### Running a Custom Driver Image - -If you want to use custom driver container images, such as version 465.27, then -you can build a custom driver container image. Follow these steps: - -- Rebuild the driver container by specifying the `$DRIVER_VERSION` argument when building the Docker image. For - reference, the driver container Dockerfiles are available on the Git repository at https://github.com/NVIDIA/gpu-driver-container/. -- Build the container using the appropriate Dockerfile. For example: - - ```console - $ docker build --pull -t \ - --build-arg DRIVER_VERSION=455.28 \ - nvidia/driver:455.28-ubuntu20.04 \ - --file Dockerfile . - ``` - - Ensure that the driver container is tagged as shown in the example by using the `driver:-` schema. -- Specify the new driver image and repository by overriding the defaults in - the Helm install command. For example: - - ```console - $ helm install --wait --generate-name \ - -n gpu-operator --create-namespace \ - nvidia/gpu-operator \ - --version= \ - --set driver.repository=docker.io/nvidia \ - --set driver.version="465.27" - ``` - -These instructions are provided for reference and evaluation purposes. -Not using the standard releases of the GPU Operator from NVIDIA would mean limited -support for such custom configurations. - -## Specifying Configuration Options for containerd - -> [!NOTE] -> It's recommended that you enable the NRI Plugin to configure the container runtime by setting `cdi.nriPluginEnabled=true`. -> When enabled, you do not need to specify the `toolkit.env` options and injecting GPUs into workload containers is handled by the NRI Plugin. -> Refer to the Container Device Interface and NRI page (use the `gpu-operator-container-device` skill) for more information. -> When you use containerd as the container runtime, the following configuration -> options are used with the container-toolkit deployed with GPU Operator: - -```yaml -toolkit: - env: - - name: CONTAINERD_CONFIG - value: /etc/containerd/config.toml - - name: CONTAINERD_SOCKET - value: /run/containerd/containerd.sock - - name: RUNTIME_CONFIG_SOURCE - value: "command,file" -``` - -If you need to specify custom values, refer to the following sample command for the syntax: - -```console -helm install gpu-operator -n gpu-operator --create-namespace \ - nvidia/gpu-operator $HELM_OPTIONS \ - --version= \ - --set toolkit.env[0].name=CONTAINERD_CONFIG \ - --set toolkit.env[0].value=/etc/containerd/containerd.toml \ - --set toolkit.env[1].name=CONTAINERD_SOCKET \ - --set toolkit.env[1].value=/run/containerd/containerd.sock \ - --set toolkit.env[2].name=RUNTIME_CONFIG_SOURCE \ - --set toolkit.env[2].value="command,file" -``` - -These options are defined as follows: - -CONTAINERD_CONFIG - The path on the host to the top-level `containerd` config file. - By default this will point to `/etc/containerd/containerd.toml` - (the default location for `containerd`). It should be customized if your `containerd` - installation is not in the default location. - -CONTAINERD_SOCKET - The path on the host to the socket file used to - communicate with `containerd`. The operator will use this to send a - `SIGHUP` signal to the `containerd` daemon to reload its config. By - default this will point to `/run/containerd/containerd.sock` - (the default location for `containerd`). It should be customized if - your `containerd` installation is not in the default location. - -RUNTIME_CONFIG_SOURCE - The config source(s) that the container-toolkit uses when fetching - the current containerd configuration. A valid value for this setting is any - combination of [command | file]. By default this will be configured as - "command,file" which means the container-toolkit will attempt to fetch - the configuration using the containerd CLI before falling back to reading the - config from the top-level `containerd` config file (configured using - CONTAINERD_CONFIG). When `file` is specified, the absolute path to the file - to be used as a config source can be specified as `file=/path/to/source/config.toml` - -RUNTIME_DROP_IN_CONFIG - The path on the host where the NVIDIA-specific drop-in config file - will be created. By default this will point to `/etc/containerd/conf.d/99-nvidia.toml`. - -### Rancher Kubernetes Engine 2 - -For Rancher Kubernetes Engine 2 (RKE2), refer to -[Deploy NVIDIA Operator](https://docs.rke2.io/add-ons/gpu_operators#deploy-nvidia-operator) -in the RKE2 documentation. - -It's recommended that you enable CDI (default) and the NRI Plugin on RKE. -With both features enabled, you do not need to set `runtimeClassName: nvidia` in your pod spec. - -Refer to the [v24.9.0 known limitations](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/release-notes.html) in the release notes. - -### MicroK8s - -For MicroK8s, set the following in the `ClusterPolicy`. - -```yaml -toolkit: - env: - - name: CONTAINERD_CONFIG - value: /var/snap/microk8s/current/args/containerd-template.toml - - name: CONTAINERD_SOCKET - value: /var/snap/microk8s/common/run/containerd.sock - - name: RUNTIME_CONFIG_SOURCE - value: "file=/var/snap/microk8s/current/args/containerd.toml" -``` - -These options can be passed to GPU Operator during install time as below. - -```console -helm install gpu-operator -n gpu-operator --create-namespace \ - nvidia/gpu-operator $HELM_OPTIONS \ - --version= \ - --set toolkit.env[0].name=CONTAINERD_CONFIG \ - --set toolkit.env[0].value=/var/snap/microk8s/current/args/containerd-template.toml \ - --set toolkit.env[1].name=CONTAINERD_SOCKET \ - --set toolkit.env[1].value=/var/snap/microk8s/common/run/containerd.sock \ - --set toolkit.env[2].name=RUNTIME_CONFIG_SOURCE \ - --set-string toolkit.env[2].value=file=/var/snap/microk8s/current/args/containerd.toml -``` - -## Verification: Running Sample GPU Applications - -### CUDA VectorAdd - -In the first example, let's run a simple CUDA sample, which adds two vectors together: - -1. Create a file, such as `cuda-vectoradd.yaml`, with contents like the following: - - ```yaml - apiVersion: v1 - kind: Pod - metadata: - name: cuda-vectoradd - spec: - restartPolicy: OnFailure - containers: - - name: cuda-vectoradd - image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04" - resources: - limits: - nvidia.com/gpu: 1 - ``` - -1. Run the pod: - - ```console - $ kubectl apply -f cuda-vectoradd.yaml - ``` - - The pod starts, runs the `vectorAdd` command, and then exits. - -1. View the logs from the container: - - ```console - $ kubectl logs pod/cuda-vectoradd - ``` - - *Example Output* - - ```output - [Vector addition of 50000 elements] - Copy input data from the host memory to the CUDA device - CUDA kernel launch with 196 blocks of 256 threads - Copy output data from the CUDA device to the host memory - Test PASSED - Done - ``` - -1. Remove the stopped pod: - - ```console - $ kubectl delete -f cuda-vectoradd.yaml - ``` - - *Example Output* - - ```output - pod "cuda-vectoradd" deleted - ``` - -### Jupyter Notebook - -You can perform the following steps to deploy Jupyter Notebook in your cluster: - -1. Create a file, such as `tf-notebook.yaml`, with contents like the following example: - - ```yaml - --- - apiVersion: v1 - kind: Service - metadata: - name: tf-notebook - labels: - app: tf-notebook - spec: - type: NodePort - ports: - - port: 80 - name: http - targetPort: 8888 - nodePort: 30001 - selector: - app: tf-notebook - --- - apiVersion: v1 - kind: Pod - metadata: - name: tf-notebook - labels: - app: tf-notebook - spec: - securityContext: - fsGroup: 0 - containers: - - name: tf-notebook - image: tensorflow/tensorflow:latest-gpu-jupyter - resources: - limits: - nvidia.com/gpu: 1 - ports: - - containerPort: 8888 - name: notebook - ``` - -1. Apply the manifest to deploy the pod and start the service: - - ```console - $ kubectl apply -f tf-notebook.yaml - ``` - -1. Check the pod status: - - ```console - $ kubectl get pod tf-notebook - ``` - - *Example Output* - - ```output - NAMESPACE NAME READY STATUS RESTARTS AGE - default tf-notebook 1/1 Running 0 3m45s - ``` - -1. Because the manifest includes a service, get the external port for the notebook: - - ```console - $ kubectl get svc tf-notebook - ``` - - *Example Output* - - ```output - NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE - tf-notebook NodePort 10.106.229.20 80:30001/TCP 4m41s - ``` - -1. Get the token for the Jupyter notebook: - - ```console - $ kubectl logs tf-notebook - ``` - - *Example Output* - - ```output - [I 21:50:23.188 NotebookApp] Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret - [I 21:50:23.390 NotebookApp] Serving notebooks from local directory: /tf - [I 21:50:23.391 NotebookApp] The Jupyter Notebook is running at: - [I 21:50:23.391 NotebookApp] http://tf-notebook:8888/?token=3660c9ee9b225458faaf853200bc512ff2206f635ab2b1d9 - [I 21:50:23.391 NotebookApp] or http://127.0.0.1:8888/?token=3660c9ee9b225458faaf853200bc512ff2206f635ab2b1d9 - [I 21:50:23.391 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). - [C 21:50:23.394 NotebookApp] - - To access the notebook, open this file in a browser: - file:///root/.local/share/jupyter/runtime/nbserver-1-open.html - Or copy and paste one of these URLs: - http://tf-notebook:8888/?token=3660c9ee9b225458faaf853200bc512ff2206f635ab2b1d9 - or http://127.0.0.1:8888/?token=3660c9ee9b225458faaf853200bc512ff2206f635ab2b1d9 - ``` - -The notebook should now be accessible from your browser at this URL: -[http://your-machine-ip:30001/?token=3660c9ee9b225458faaf853200bc512ff2206f635ab2b1d9](http://your-machine-ip:30001/?token=3660c9ee9b225458faaf853200bc512ff2206f635ab2b1d9). - -## Installation on Commercially Supported Kubernetes Platforms - -| Product | Documentation | -| --- | --- | -| Red Hat OpenShift 4 using RHCOS worker nodes | [NVIDIA GPU Operator on Red Hat OpenShift](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/index.html) | -| VMware vSphere Kubernetes Service and NVIDIA AI Enterprise | [NVIDIA AI Enterprise VMware vSphere Deployment Guide](https://docs.nvidia.com/ai-enterprise/deployment-guide-vmware/0.1.0/index.html) | -| Google Cloud Anthos | [Google Cloud Anthos guide](https://docs.nvidia.com/datacenter/cloud-native/edge/latest/anthos-guide.html) | +After install, run a sample GPU workload (CUDA VectorAdd) and confirm +`Test PASSED`. Exact manifests and commands are in +[references/verification.md](references/verification.md). diff --git a/gpu-operator/.agents/skills/gpu-operator-install/references/chart-options.md b/gpu-operator/.agents/skills/gpu-operator-install/references/chart-options.md new file mode 100644 index 000000000..9c6d036be --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-install/references/chart-options.md @@ -0,0 +1,47 @@ + + + +# Common Chart Customization Options + +The following options are available when using the Helm chart. +These options can be used with `--set` when installing with Helm. + +The following table identifies the most frequently used options. +To view all the options, run `helm show values nvidia/gpu-operator`. + +| Parameter | Description | Default | +| --- | --- | --- | +| `ccManager.enabled` | When set to `true`, the Operator deploys NVIDIA Confidential Computing Manager for Kubernetes. | `false` | +| `cdi.enabled` | When set to `true` (default), the Container Device Interface (CDI) will be used for injecting GPUs into workload containers. The Operator will no longer configure the `nvidia` runtime class as the default runtime handler. Instead, native-CDI support in container runtimes like containerd or cri-o will be leveraged for injecting GPUs into workload containers. Refer to the Container Device Interface page (use the `gpu-operator-container-device` skill) for more information. | `true` | +| `cdi.nriPluginEnabled` | When set to `true`, the Node Resource Interface (NRI) Plugin will be used for injecting GPUs into workload containers. In NRI Plugin mode, the NVIDIA Container Toolkit will no longer modify the runtime config. This feature requires containerd v1.7.30, v2.1.x, or v2.2.x. Refer to the Container Device Interface page (use the `gpu-operator-container-device` skill) for more information. | `false` | +| `cdi.default` Deprecated. | This field is deprecated as of v25.10.0 and will be ignored. The `cdi.enabled` field is set to `true` by default in versions 25.10.0 and later. When set to `true`, the container runtime uses CDI to perform device injection by default. | `false` | +| `daemonsets.annotations` | Map of custom annotations to add to all GPU Operator managed pods. | `{}` | +| `daemonsets.labels` | Map of custom labels to add to all GPU Operator managed pods. | `{}` | +| `dcgmExporter.enabled` | By default, the Operator gathers GPU telemetry in Kubernetes using [DCGM Exporter](https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/dcgm-exporter.html). Set this value to `false` to disable it. Available values are `true` (default) or `false`. | `true` | +| `dcgmExporter.service.internalTrafficPolicy` | Specifies the [internalTrafficPolicy](https://kubernetes.io/docs/concepts/services-networking/service/#traffic-policies) for the DCGM Exporter service. Available values are `Cluster` (default) or `Local`. | `Cluster` | +| `dcgmExporter.hostNetwork` | When set to `true`, the DCGM Exporter will expose a metric port on the host's network namespace. | `false` | +| `devicePlugin.config` | Specifies the configuration for the NVIDIA Device Plugin as a config map. In most cases, this field is configured after installing the Operator, such as to configure GPU time-slicing (use the `gpu-operator-timeslicing-gpus` skill). | `{}` | +| `driver.enabled` | By default, the Operator deploys NVIDIA drivers as a container on the system. Set this value to `false` when using the Operator on systems with pre-installed drivers. | `true` | +| `driver.image` | Name of the NVIDIA Driver Container image to use. | `driver` | +| `driver.imagePullSecrets` | List of the image pull secret used for pulling the driver container image from the registry. | None | +| `driver.kernelModuleType` | Specifies the type of the NVIDIA GPU Kernel modules to use. Valid values are `auto` (default), `proprietary`, and `open`. `Auto` means that the recommended kernel module type (open or proprietary) is chosen based on the GPU devices on the host and the driver branch used. The `auto` option is only supported with the 570.86.15 and 570.124.06 or later driver containers. 550 and 535 branch drivers do not yet support this mode. `Open` means the open kernel module is used. `Proprietary` means the proprietary module is used. | `auto` | +| `driver.nvidiaDriverCRD.enabled` | When set to `true`, the Operator deploys NVIDIA GPU Driver Custom Resource Definition. Refer to the NVIDIA GPU Driver Custom Resource Definition (use the `gpu-operator-nvidia-driver` skill) page for more information. | `false` | +| `driver.repository` | The images are downloaded from NGC. Specify another image repository when using custom driver images. | `nvcr.io/nvidia` | +| `driver.rdma.enabled` | Controls whether the driver daemon set builds and loads the legacy `nvidia-peermem` kernel module. You might be able to use GPUDirect RDMA without enabling this option. Refer to the GPUDirect RDMA page (use the `gpu-operator-gpudirect-rdma` skill) for information about whether you can use DMA-BUF or you need to use legacy `nvidia-peermem`. | `false` | +| `driver.rdma.useHostMofed` | Indicate if MLNX_OFED (MOFED) drivers are pre-installed on the host. | `false` | +| `driver.secretEnv` | The name of the secret to the driver container. A common use case is to use this field to pass your Ubuntu Pro token secret if you are deploying the GPU Operator with government-ready components. Refer to the government-ready installation page (use the `gpu-operator-install-governmentready-environments` skill) for more information. | None | +| `driver.startupProbe` | By default, the driver container has an initial delay of `60s` before starting liveness probes. The probe runs the `nvidia-smi` command with a timeout duration of `60s`. You can increase the `timeoutSeconds` duration if the `nvidia-smi` command runs slowly in your cluster. | `60s` | +| `driver.useOpenKernelModules` Deprecated. | This field is deprecated as of v25.3.0 and will be ignored. Use `kernelModuleType` instead. When set to `true`, the driver containers install the NVIDIA Open GPU Kernel module driver. | `false` | +| `driver.usePrecompiled` | When set to `true`, the Operator attempts to deploy driver containers that have precompiled kernel drivers. Refer to the precompiled driver containers (use the `gpu-operator-precompiled-drivers` skill) page for the supported operating systems. | `false` | +| `driver.version` | Version of the NVIDIA datacenter driver supported by the Operator. If you set `driver.usePrecompiled` to `true`, then set this field to a driver branch, such as `525`. | Depends on the version of the Operator. Refer to the [GPU Operator Component Matrix](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/life-cycle-policy.html#gpu-operator-component-matrix) for more information on supported drivers. | +| `gdrcopy.enabled` | Enables support for GDRCopy. When set to `true`, the GDRCopy Driver runs as a sidecar container in the GPU driver pod. For information about GDRCopy, refer to the [gdrcopy](https://developer.nvidia.com/gdrcopy) page. You can enable GDRCopy if you use the NVIDIA GPU Driver custom resource (use the `gpu-operator-nvidia-driver` skill). | `false` | +| `mig.strategy` | Controls the strategy to be used with MIG on supported NVIDIA GPUs. Options are either `mixed` or `single`. | `single` | +| `migManager.enabled` | The MIG manager watches for changes to the MIG geometry and applies reconfiguration as needed. By default, the MIG manager only runs on nodes with GPUs that support MIG (such as the A100). | `true` | +| `nfd.enabled` | Deploys Node Feature Discovery plugin as a daemonset. Set this variable to `false` if NFD is already running in the cluster. | `true` | +| `nfd.nodefeaturerules` | Installs node feature rules that are related to confidential computing. NFD uses the rules to detect security features in CPUs and NVIDIA GPUs. Set this variable to `true` when you configure the Operator for Confidential Containers. | `false` | +| `operator.labels` | Map of custom labels that will be added to all GPU Operator managed pods. | `{}` | +| `psp.enabled` | The GPU Operator deploys `PodSecurityPolicies` if enabled. | `false` | +| `sandboxWorkloads.enabled` | Specifies if sandbox containers are enabled. | `false` | +| `sandboxWorkloads.defaultWorkload` | Specifies the default type of workload for the cluster, one of `container`, `vm-passthrough`, or `vm-vgpu`. Setting `vm-passthrough` or `vm-vgpu` can be helpful if you plan to run all or mostly virtual machines in your cluster. Refer to KubeVirt (use the `gpu-operator-kubevirt` skill), Kata Containers (use the `gpu-operator-kata-containers` skill) for more details on deploying different workload containers. | `container` | +| `sandboxWorkloads.mode` | Specifies the sandbox mode to use when deploying sandbox workloads. Accepted values are `kubevirt` (default) and `kata`. Refer to the KubeVirt (use the `gpu-operator-kubevirt` skill) or the Kata Containers (use the `gpu-operator-kata-containers` skill) pages for more information on using KubeVirt or Kata based workloads. | `kubevirt` | +| `toolkit.enabled` | By default, the Operator deploys the NVIDIA Container Toolkit (`nvidia-docker2` stack) as a container on the system. Set this value to `false` when using the Operator on systems with pre-installed NVIDIA runtimes. | `true` | diff --git a/gpu-operator/.agents/skills/gpu-operator-install/references/containerd-config.md b/gpu-operator/.agents/skills/gpu-operator-install/references/containerd-config.md new file mode 100644 index 000000000..77532431e --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-install/references/containerd-config.md @@ -0,0 +1,116 @@ + + + +# Specifying Configuration Options for containerd + +Throughout, replace `` with your target GPU Operator release. + +> [!NOTE] +> It's recommended that you enable the NRI Plugin to configure the container runtime by setting `cdi.nriPluginEnabled=true`. +> When enabled, you do not need to specify the `toolkit.env` options and injecting GPUs into workload containers is handled by the NRI Plugin. +> Refer to the Container Device Interface and NRI page (use the `gpu-operator-container-device` skill) for more information. +> When you use containerd as the container runtime, the following configuration +> options are used with the container-toolkit deployed with GPU Operator: + +```yaml +toolkit: + env: + - name: CONTAINERD_CONFIG + value: /etc/containerd/config.toml + - name: CONTAINERD_SOCKET + value: /run/containerd/containerd.sock + - name: RUNTIME_CONFIG_SOURCE + value: "command,file" +``` + +If you need to specify custom values, refer to the following sample command for the syntax: + +```console +helm install gpu-operator -n gpu-operator --create-namespace \ + nvidia/gpu-operator $HELM_OPTIONS \ + --version= \ + --set toolkit.env[0].name=CONTAINERD_CONFIG \ + --set toolkit.env[0].value=/etc/containerd/containerd.toml \ + --set toolkit.env[1].name=CONTAINERD_SOCKET \ + --set toolkit.env[1].value=/run/containerd/containerd.sock \ + --set toolkit.env[2].name=RUNTIME_CONFIG_SOURCE \ + --set toolkit.env[2].value="command,file" +``` + +These options are defined as follows: + +CONTAINERD_CONFIG + The path on the host to the top-level `containerd` config file. + By default this will point to `/etc/containerd/containerd.toml` + (the default location for `containerd`). It should be customized if your `containerd` + installation is not in the default location. + +CONTAINERD_SOCKET + The path on the host to the socket file used to + communicate with `containerd`. The operator will use this to send a + `SIGHUP` signal to the `containerd` daemon to reload its config. By + default this will point to `/run/containerd/containerd.sock` + (the default location for `containerd`). It should be customized if + your `containerd` installation is not in the default location. + +RUNTIME_CONFIG_SOURCE + The config source(s) that the container-toolkit uses when fetching + the current containerd configuration. A valid value for this setting is any + combination of [command | file]. By default this will be configured as + "command,file" which means the container-toolkit will attempt to fetch + the configuration using the containerd CLI before falling back to reading the + config from the top-level `containerd` config file (configured using + CONTAINERD_CONFIG). When `file` is specified, the absolute path to the file + to be used as a config source can be specified as `file=/path/to/source/config.toml` + +RUNTIME_DROP_IN_CONFIG + The path on the host where the NVIDIA-specific drop-in config file + will be created. By default this will point to `/etc/containerd/conf.d/99-nvidia.toml`. + +## Rancher Kubernetes Engine 2 + +For Rancher Kubernetes Engine 2 (RKE2), refer to +[Deploy NVIDIA Operator](https://docs.rke2.io/add-ons/gpu_operators#deploy-nvidia-operator) +in the RKE2 documentation. + +It's recommended that you enable CDI (default) and the NRI Plugin on RKE. +With both features enabled, you do not need to set `runtimeClassName: nvidia` in your pod spec. + +Refer to the [v24.9.0 known limitations](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/release-notes.html) in the release notes. + +## MicroK8s + +For MicroK8s, set the following in the `ClusterPolicy`. + +```yaml +toolkit: + env: + - name: CONTAINERD_CONFIG + value: /var/snap/microk8s/current/args/containerd-template.toml + - name: CONTAINERD_SOCKET + value: /var/snap/microk8s/common/run/containerd.sock + - name: RUNTIME_CONFIG_SOURCE + value: "file=/var/snap/microk8s/current/args/containerd.toml" +``` + +These options can be passed to GPU Operator during install time as below. + +```console +helm install gpu-operator -n gpu-operator --create-namespace \ + nvidia/gpu-operator $HELM_OPTIONS \ + --version= \ + --set toolkit.env[0].name=CONTAINERD_CONFIG \ + --set toolkit.env[0].value=/var/snap/microk8s/current/args/containerd-template.toml \ + --set toolkit.env[1].name=CONTAINERD_SOCKET \ + --set toolkit.env[1].value=/var/snap/microk8s/common/run/containerd.sock \ + --set toolkit.env[2].name=RUNTIME_CONFIG_SOURCE \ + --set-string toolkit.env[2].value=file=/var/snap/microk8s/current/args/containerd.toml +``` + +## Installation on Commercially Supported Kubernetes Platforms + +| Product | Documentation | +| --- | --- | +| Red Hat OpenShift 4 using RHCOS worker nodes | [NVIDIA GPU Operator on Red Hat OpenShift](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/index.html) | +| VMware vSphere Kubernetes Service and NVIDIA AI Enterprise | [NVIDIA AI Enterprise VMware vSphere Deployment Guide](https://docs.nvidia.com/ai-enterprise/deployment-guide-vmware/0.1.0/index.html) | +| Google Cloud Anthos | [Google Cloud Anthos guide](https://docs.nvidia.com/datacenter/cloud-native/edge/latest/anthos-guide.html) | diff --git a/gpu-operator/.agents/skills/gpu-operator-install/references/deployment-scenarios.md b/gpu-operator/.agents/skills/gpu-operator-install/references/deployment-scenarios.md new file mode 100644 index 000000000..f2e8664c3 --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-install/references/deployment-scenarios.md @@ -0,0 +1,156 @@ + + + +# Common Deployment Scenarios + +The following common deployment scenarios and sample commands apply best to +bare metal hosts or virtual machines with GPU passthrough. + +Throughout, replace `` with your target GPU Operator release. + +## Specifying the Operator Namespace + +Both the Operator and operands are installed in the same namespace. +The namespace is configurable and is specified during installation. +For example, to install the GPU Operator in the `nvidia-gpu-operator` namespace: + +```console +$ helm install --wait --generate-name \ + -n nvidia-gpu-operator --create-namespace \ + nvidia/gpu-operator \ + --version= \ +``` + +If you do not specify a namespace during installation, all GPU Operator components are installed in the `default` namespace. + +## Preventing Installation of Operands on Some Nodes + +By default, the GPU Operator operands are deployed on all GPU worker nodes in the cluster. +GPU worker nodes are identified by the presence of the label `feature.node.kubernetes.io/pci-10de.present=true`. +The value `0x10de` is the PCI vendor ID that is assigned to NVIDIA. + +To disable operands from getting deployed on a GPU worker node, label the node with `nvidia.com/gpu.deploy.operands=false`. + +```console +$ kubectl label nodes $NODE nvidia.com/gpu.deploy.operands=false +``` + +## Preventing Installation of NVIDIA GPU Driver on Some Nodes + +By default, the GPU Operator deploys the driver on all GPU worker nodes in the cluster. +To prevent installing the driver on a GPU worker node, label the node like the following sample command. + +```console +$ kubectl label nodes $NODE nvidia.com/gpu.deploy.driver=false +``` + +## Installation on Red Hat Enterprise Linux + +In this scenario, use the NVIDIA Container Toolkit image that is built on UBI 8: + +```console +$ helm install --wait --generate-name \ + -n gpu-operator --create-namespace \ + nvidia/gpu-operator \ + --version= \ + --set toolkit.version=v1.16.1-ubi8 +``` + +Replace the `v1.16.1` value in the preceding command with the version that is supported +with the NVIDIA GPU Operator. +Refer to the [GPU Operator Component Matrix](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/life-cycle-policy.html#gpu-operator-component-matrix) on the platform support page. + +When using RHEL8 with Kubernetes, SELinux must be enabled either in permissive or enforcing mode for use with the GPU Operator. +Additionally, when using RHEL8 with containerd as the runtime and SELinux is enabled (either in permissive or enforcing mode) at the host level, containerd must also be configured for SELinux, by setting the `enable_selinux=true` configuration option. +Network restricted environments are not supported. + +## Pre-Installed NVIDIA GPU Drivers + +In this scenario, the NVIDIA GPU driver is already installed on the worker nodes that have GPUs: + +```console +$ helm install --wait --generate-name \ + -n gpu-operator --create-namespace \ + nvidia/gpu-operator \ + --version= \ + --set driver.enabled=false +``` + +The preceding command prevents the Operator from installing the GPU driver on any nodes in the cluster. + +If you do not specify the `driver.enabled=false` argument and nodes in the cluster have a pre-installed GPU driver, the init container in the driver pod detects that the driver is preinstalled and labels the node so that the driver pod is terminated and does not get re-scheduled on to the node. +The Operator proceeds to start other pods, such as the container toolkit pod. + +## Pre-Installed NVIDIA GPU Drivers and NVIDIA Container Toolkit + +In this scenario, the NVIDIA GPU driver and the NVIDIA Container Toolkit are already installed on +the worker nodes that have GPUs. + +> [!TIP] +> This scenario applies to NVIDIA DGX Systems that run NVIDIA Base OS. +> Before installing the Operator, ensure that the default runtime is set to `nvidia`. +> Refer to the [NVIDIA Container Toolkit configuration](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) documentation for more information. + +Install the Operator with the following options: + +```console +$ helm install --wait --generate-name \ + -n gpu-operator --create-namespace \ + nvidia/gpu-operator \ + --version= \ + --set driver.enabled=false \ + --set toolkit.enabled=false +``` + +## Pre-Installed NVIDIA Container Toolkit (but no drivers) + +In this scenario, the NVIDIA Container Toolkit is already installed on the worker nodes that have GPUs. + +1. Configure toolkit to use the `root` directory of the driver installation as `/run/nvidia/driver`, because this is the path mounted by driver container. + + ```console + $ sudo sed -i 's/^#root/root/' /etc/nvidia-container-runtime/config.toml + ``` + +1. Install the Operator with the following options (which will provision a driver): + + ```console + $ helm install --wait --generate-name \ + -n gpu-operator --create-namespace \ + nvidia/gpu-operator \ + --version= \ + --set toolkit.enabled=false + ``` + +## Running a Custom Driver Image + +If you want to use custom driver container images, such as version 465.27, then +you can build a custom driver container image. Follow these steps: + +- Rebuild the driver container by specifying the `$DRIVER_VERSION` argument when building the Docker image. For + reference, the driver container Dockerfiles are available on the Git repository at https://github.com/NVIDIA/gpu-driver-container/. +- Build the container using the appropriate Dockerfile. For example: + + ```console + $ docker build --pull -t \ + --build-arg DRIVER_VERSION=455.28 \ + nvidia/driver:455.28-ubuntu20.04 \ + --file Dockerfile . + ``` + + Ensure that the driver container is tagged as shown in the example by using the `driver:-` schema. +- Specify the new driver image and repository by overriding the defaults in + the Helm install command. For example: + + ```console + $ helm install --wait --generate-name \ + -n gpu-operator --create-namespace \ + nvidia/gpu-operator \ + --version= \ + --set driver.repository=docker.io/nvidia \ + --set driver.version="465.27" + ``` + +These instructions are provided for reference and evaluation purposes. +Not using the standard releases of the GPU Operator from NVIDIA would mean limited +support for such custom configurations. diff --git a/gpu-operator/.agents/skills/gpu-operator-install/references/install.md b/gpu-operator/.agents/skills/gpu-operator-install/references/install.md new file mode 100644 index 000000000..749f10014 --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-install/references/install.md @@ -0,0 +1,39 @@ + + + +# Install Procedure + +Throughout, replace `` with your target GPU Operator release (for example, the latest patch release listed on the [GPU Operator releases page](https://github.com/NVIDIA/gpu-operator/releases)). + +> [!TIP] +> For installation on Red Hat OpenShift Container Platform, refer to [OpenShift installation steps](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/steps-overview.html). + +1. Add the NVIDIA Helm repository: + + ```console + $ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \ + && helm repo update + ``` + +1. Install the GPU Operator. + + - Install the Operator with the default configuration: + + ```console + $ helm install --wait --generate-name \ + -n gpu-operator --create-namespace \ + nvidia/gpu-operator \ + --version= + ``` + + - Install the Operator and specify configuration options: + + ```console + $ helm install --wait --generate-name \ + -n gpu-operator --create-namespace \ + nvidia/gpu-operator \ + --version= \ + --set = + ``` + + Refer to the chart customization options (see [references/chart-options.md](chart-options.md)) and common deployment scenarios (see [references/deployment-scenarios.md](deployment-scenarios.md)) for more information. diff --git a/gpu-operator/.agents/skills/gpu-operator-install/references/prerequisites.md b/gpu-operator/.agents/skills/gpu-operator-install/references/prerequisites.md new file mode 100644 index 000000000..a608c2e16 --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-install/references/prerequisites.md @@ -0,0 +1,42 @@ + + + +# GPU Operator Install Prerequisites + +1. You have the `kubectl` and `helm` CLIs available on a client machine. + + You can run the following commands to install the Helm CLI: + + ```console + $ curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \ + && chmod 700 get_helm.sh \ + && ./get_helm.sh + ``` + +1. If you are planning to use ClusterPolicy for driver configuration, all worker nodes or node groups to run GPU workloads in the Kubernetes cluster must run the same operating system version to use the NVIDIA GPU Driver container. + Alternatively, if you pre-install the NVIDIA GPU Driver on the nodes, then you can run different operating systems. + + For worker nodes or node groups that run CPU workloads only, the nodes can run any operating system because the GPU Operator does not perform any configuration or management of nodes for CPU-only workloads. + + If you are planning to use the NVIDIA GPU Driver Custom Resource Definition, you can use a mix of operating system versions on CPU and GPU nodes. Refer to the NVIDIA GPU Driver Custom Resource Definition (use the `gpu-operator-nvidia-driver` skill) page for more information. + +1. Nodes must be configured with a container engine such as CRI-O or containerd. + +1. If your cluster uses Pod Security Admission (PSA) to restrict the behavior of pods, label the namespace for the Operator to set the enforcement policy to privileged: + + ```console + $ kubectl create ns gpu-operator + $ kubectl label --overwrite ns gpu-operator pod-security.kubernetes.io/enforce=privileged + ``` + +1. Node Feature Discovery (NFD) is a dependency for the Operator on each node. + By default, NFD master and worker are automatically deployed by the Operator. + If NFD is already running in the cluster, then you must disable deploying NFD when you install the Operator. + + One way to determine if NFD is already running in the cluster is to check for an NFD label on your nodes: + + ```console + $ kubectl get nodes -o json | jq '.items[].metadata.labels | keys | any(startswith("feature.node.kubernetes.io"))' + ``` + + If the command output is `true`, then NFD is already running in the cluster. diff --git a/gpu-operator/.agents/skills/gpu-operator-install/references/verification.md b/gpu-operator/.agents/skills/gpu-operator-install/references/verification.md new file mode 100644 index 000000000..4ba2c08f9 --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-install/references/verification.md @@ -0,0 +1,165 @@ + + + +# Verification: Running Sample GPU Applications + +## CUDA VectorAdd + +In the first example, let's run a simple CUDA sample, which adds two vectors together: + +1. Create a file, such as `cuda-vectoradd.yaml`, with contents like the following: + + ```yaml + apiVersion: v1 + kind: Pod + metadata: + name: cuda-vectoradd + spec: + restartPolicy: OnFailure + containers: + - name: cuda-vectoradd + image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04" + resources: + limits: + nvidia.com/gpu: 1 + ``` + +1. Run the pod: + + ```console + $ kubectl apply -f cuda-vectoradd.yaml + ``` + + The pod starts, runs the `vectorAdd` command, and then exits. + +1. View the logs from the container: + + ```console + $ kubectl logs pod/cuda-vectoradd + ``` + + *Example Output* + + ```output + [Vector addition of 50000 elements] + Copy input data from the host memory to the CUDA device + CUDA kernel launch with 196 blocks of 256 threads + Copy output data from the CUDA device to the host memory + Test PASSED + Done + ``` + +1. Remove the stopped pod: + + ```console + $ kubectl delete -f cuda-vectoradd.yaml + ``` + + *Example Output* + + ```output + pod "cuda-vectoradd" deleted + ``` + +## Jupyter Notebook + +You can perform the following steps to deploy Jupyter Notebook in your cluster: + +1. Create a file, such as `tf-notebook.yaml`, with contents like the following example: + + ```yaml + --- + apiVersion: v1 + kind: Service + metadata: + name: tf-notebook + labels: + app: tf-notebook + spec: + type: NodePort + ports: + - port: 80 + name: http + targetPort: 8888 + nodePort: 30001 + selector: + app: tf-notebook + --- + apiVersion: v1 + kind: Pod + metadata: + name: tf-notebook + labels: + app: tf-notebook + spec: + securityContext: + fsGroup: 0 + containers: + - name: tf-notebook + image: tensorflow/tensorflow:latest-gpu-jupyter + resources: + limits: + nvidia.com/gpu: 1 + ports: + - containerPort: 8888 + name: notebook + ``` + +1. Apply the manifest to deploy the pod and start the service: + + ```console + $ kubectl apply -f tf-notebook.yaml + ``` + +1. Check the pod status: + + ```console + $ kubectl get pod tf-notebook + ``` + + *Example Output* + + ```output + NAMESPACE NAME READY STATUS RESTARTS AGE + default tf-notebook 1/1 Running 0 3m45s + ``` + +1. Because the manifest includes a service, get the external port for the notebook: + + ```console + $ kubectl get svc tf-notebook + ``` + + *Example Output* + + ```output + NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE + tf-notebook NodePort 10.106.229.20 80:30001/TCP 4m41s + ``` + +1. Get the token for the Jupyter notebook: + + ```console + $ kubectl logs tf-notebook + ``` + + *Example Output* + + ```output + [I 21:50:23.188 NotebookApp] Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret + [I 21:50:23.390 NotebookApp] Serving notebooks from local directory: /tf + [I 21:50:23.391 NotebookApp] The Jupyter Notebook is running at: + [I 21:50:23.391 NotebookApp] http://tf-notebook:8888/?token=3660c9ee9b225458faaf853200bc512ff2206f635ab2b1d9 + [I 21:50:23.391 NotebookApp] or http://127.0.0.1:8888/?token=3660c9ee9b225458faaf853200bc512ff2206f635ab2b1d9 + [I 21:50:23.391 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). + [C 21:50:23.394 NotebookApp] + + To access the notebook, open this file in a browser: + file:///root/.local/share/jupyter/runtime/nbserver-1-open.html + Or copy and paste one of these URLs: + http://tf-notebook:8888/?token=3660c9ee9b225458faaf853200bc512ff2206f635ab2b1d9 + or http://127.0.0.1:8888/?token=3660c9ee9b225458faaf853200bc512ff2206f635ab2b1d9 + ``` + +The notebook should now be accessible from your browser at this URL: +[http://your-machine-ip:30001/?token=3660c9ee9b225458faaf853200bc512ff2206f635ab2b1d9](http://your-machine-ip:30001/?token=3660c9ee9b225458faaf853200bc512ff2206f635ab2b1d9). diff --git a/gpu-operator/.agents/skills/gpu-operator-kata-containers/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-kata-containers/SKILL.md index 42bbf8e41..b345bb896 100644 --- a/gpu-operator/.agents/skills/gpu-operator-kata-containers/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-kata-containers/SKILL.md @@ -20,6 +20,10 @@ tags: # Deploy with Kata Containers +Configure the NVIDIA GPU Operator to run sandboxed GPU workloads with +[Kata Containers](https://katacontainers.io/), which run pods inside lightweight +VMs for stronger workload isolation via GPU passthrough. + ## Prerequisites - A running Kubernetes cluster with NVIDIA GPU worker nodes, and the `kubectl` and `helm` CLIs available. @@ -27,557 +31,35 @@ tags: - Hosts configured to support IOMMU. Check with `ls /sys/kernel/iommu_groups`; if the host is not configured, add the `intel_iommu=on` (or `amd_iommu=on` for AMD CPUs) kernel command-line argument. - For Kubernetes versions older than v1.34, the `KubeletPodResourcesGet` feature gate must be explicitly enabled. -## About the Operator with Kata Containers - -[Kata Containers](https://katacontainers.io/) is an open source project that creates lightweight Virtual Machines (VMs) that feel and perform like traditional containers such as a Docker container. -A traditional container packages software for user-space isolation from the host, -but the container runs on the host and shares the operating system kernel with the host. -Sharing the operating system kernel is a potential vulnerability. - -A Kata container runs in a virtual machine on the host. -The virtual machine has a separate operating system and operating system kernel. -Hardware virtualization and a separate kernel provide improved workload isolation -in comparison with traditional containers. - -The NVIDIA GPU Operator works with the Kata container runtime. -Kata uses a hypervisor, such as QEMU, to provide a lightweight virtual machine with a single purpose: to run a Kubernetes pod. - -The following diagram shows the software components that Kubernetes uses to run a Kata container. - -```mermaid -flowchart LR - a[Kubelet] --> b[CRI] --> c[Kata\nRuntime] --> d[Lightweight\nQEMU VM] --> e[Lightweight\nGuest OS] --> f[Pod] --> g[Container] -``` - -> [!TIP] -> This page describes deploying with Kata containers only. -> Refer to the Confidential Containers documentation if you are interested in deploying Confidential Containers with Kata Containers and the GPU Operator. - -## Benefits of Using Kata Containers - -The primary benefits of Kata Containers are as follows: - -* Running untrusted workloads in a container. - The virtual machine provides a layer of defense against the untrusted code. - -* Limiting access to hardware devices such as NVIDIA GPUs. - The virtual machine is provided access to specific devices. - This approach ensures that the workload cannot access additional devices. - -* Transparent deployment of unmodified containers. - -## Limitations and Restrictions - -* For GPU passthrough workloads, all GPUs must be assigned to one Kata Container virtual machine. - Configuring only some GPUs on a node for Kata Containers is not supported. - vGPU is not supported. - -* Support for Kata Containers is limited to the implementation described on this page. - The Operator offers Technology Preview support for Red Hat OpenShift Sandboxed Containers v1.12. - -* NVIDIA supports the Operator and Kata Containers with the containerd runtime only. - -## Cluster Topology Considerations - -You can configure all the worker nodes in your cluster for Kata Containers or you can configure some nodes for Kata Containers and others for traditional containers. -Consider the following example where node A is configured to run traditional containers and node B is configured to run Kata Containers. - -| Node A - Traditional Container nodes receive the following software components | Node B - Kata Container nodes receive the following software components | -| --- | --- | -| * `NVIDIA Driver Manager for Kubernetes` -- to install the data-center driver. * `NVIDIA Container Toolkit` -- to ensure that containers can access GPUs. * `NVIDIA Device Plugin for Kubernetes` -- to discover and advertise GPU resources to kubelet. * `NVIDIA DCGM and DCGM Exporter` -- to monitor GPUs. * `NVIDIA MIG Manager for Kubernetes` -- to manage MIG-capable GPUs. * `Node Feature Discovery` -- to detect CPU, kernel, and host features and label worker nodes. * `NVIDIA GPU Feature Discovery` -- to detect NVIDIA GPUs and label worker nodes. | * `NVIDIA Confidential Computing Manager for Kubernetes` -- to set the confidential computing (CC) mode on the NVIDIA GPUs. This component is deployed to all nodes configured for Kata Containers, even if you are not planning to run Confidential Containers. Refer to the Confidential Containers documentation for more details. * `NVIDIA Sandbox Device Plugin` -- to discover and advertise the passthrough GPUs to kubelet. * `NVIDIA VFIO Manager` -- to bind NVIDIA GPUs and NVIDIA NVSwitches to the vfio-pci driver for VFIO passthrough. * `Node Feature Discovery` -- to detect CPU security features, NVIDIA GPUs, and label worker nodes. | -This configuration can be controlled through node labelling, as described in the Label Nodes section. -You can also set `sandboxWorkloads.defaultWorkload=vm-passthrough` when you install the GPU Operator to configure all nodes to run Kata Containers by default. - -## Configure the GPU Operator for Kata Containers - -To enable Kata Containers for GPUs on your cluster, you do the following: - -1. Make sure your cluster meets the prerequisites. -1. Label the nodes you want to use for Kata Containers. -1. Install the upstream `kata-deploy` Helm chart, which deploys all Kata runtime classes, including NVIDIA-specific runtime classes. - The `kata-qemu-nvidia-gpu` runtime class is used with Kata Containers. -1. Install the NVIDIA GPU Operator with Kata sandbox mode enabled. - -After installation, you can run a sample workload that uses the Kata runtime class. - -### Prerequisites - -#### Hardware and BIOS - -* Ensure hosts are configured to enable hardware virtualization and Access Control Services (ACS). - With some AMD CPUs and BIOSes, ACS might be grouped under Advanced Error Reporting (AER). - Enabling these features is typically performed by configuring the host BIOS. - -* Configure hosts to support IOMMU. - You can check if your host is configured for IOMMU by running the following command: - - ```console - $ ls /sys/kernel/iommu_groups - ``` - - If the output of this command includes 0, 1, and so on, then your host is configured for IOMMU. - - If the host is not configured or if you are unsure, add the `intel_iommu=on` (or `amd_iommu=on` for AMD CPUs) Linux kernel command-line argument. - For most Linux distributions, add the argument to the `/etc/default/grub` file: - - ```text - ... - GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on modprobe.blacklist=nouveau" - ... - ``` - - On Ubuntu systems, run `sudo update-grub` after making the change to configure the bootloader. - On other systems, you might need to run `sudo dracut` after making the change. - Refer to the documentation for your operating system. - Reboot the host after configuring the bootloader. - - > [!NOTE] - > After configuring IOMMU, you might see QEMU warnings about PCI P2P DMA when running GPU workloads. - > These are expected and can be safely ignored. - > * Ensure that no NVIDIA GPU drivers are installed on the host. - > Kata Containers uses VFIO to pass GPUs directly to the VM, and host-level GPU drivers interfere with VFIO device binding. - - To check if NVIDIA GPU drivers are installed, run the following command: - - ```console - $ lsmod | grep nvidia - ``` - - If the output is empty, no NVIDIA GPU drivers are loaded. - If modules such as `nvidia`, `nvidia_uvm`, or `nvidia_modeset` are listed, NVIDIA GPU drivers are present and must be removed before proceeding. - Refer to [Removing the Driver](https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/removing-the-driver.html) in the NVIDIA Driver Installation Guide. - -#### Kubernetes Cluster - -* A Kubernetes cluster with cluster administrator privileges. - -* Helm installed on your cluster. - Use the command below to install Helm or refer to the [Helm documentation](https://helm.sh/docs/intro/install/) for installation instructions. - - ```console - $ curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \ - && chmod 700 get_helm.sh \ - && ./get_helm.sh - ``` - -* Enable the `KubeletPodResourcesGet` Kubelet feature gate on your cluster. - The Kata runtime uses this feature gate to query the Kubelet Pod Resources API and discover allocated GPU devices during sandbox creation. - - * For Kubernetes v1.34 and later, the `KubeletPodResourcesGet` feature gate is enabled by default. - - * For Kubernetes versions older than v1.34, you must explicitly enable the `KubeletPodResourcesGet` feature gate. - Add the feature gate to your Kubelet configuration (typically `/var/lib/kubelet/config.yaml`): - - ```yaml - apiVersion: kubelet.config.k8s.io/v1beta1 - kind: KubeletConfiguration - featureGates: - KubeletPodResourcesGet: true - ``` - - If your `config.yaml` already has a `featureGates` section, add the gate to the existing section rather than creating a duplicate. - - Restart the Kubelet service to apply the changes: - - ```console - $ sudo systemctl restart kubelet - ``` - - Refer to the [Kata Containers documentation](https://github.com/kata-containers/kata-containers/blob/main/docs/use-cases/NVIDIA-GPU-passthrough-and-Kata-QEMU.md#kata-runtime) for more details on the Kata runtime and VFIO cold-plug. - -### Label Nodes to use Kata Containers - -1. Get a list of the nodes in your cluster: - - ```console - $ kubectl get nodes - ``` - - *Example Output:* - - ```output - NAME STATUS ROLES AGE VERSION - node-01 Ready 10d v1.34.0 - node-02 Ready 10d v1.34.0 - ``` - -1. Label the nodes you want to use for Kata Containers: - - ```console - $ kubectl label node nvidia.com/gpu.workload.config=vm-passthrough - ``` - - The GPU Operator uses this label to determine what software components to deploy to a node. - The `nvidia.com/gpu.workload.config=vm-passthrough` label specifies that the node should receive the software components to run Kata Containers. - A node can only run one container runtime at a time, so a labeled node runs only Kata container workloads and cannot run traditional GPU container workloads. - The labeling approach is useful if you want to run Kata container workloads on some nodes and traditional GPU container workloads on other nodes in your cluster. - Refer to the GPU Operator Cluster Topology Considerations section for more details on what gets deployed to a Kata Container node. - - > [!TIP] - > Skip this section if you plan to set `sandboxWorkloads.defaultWorkload=vm-passthrough` when you install the GPU Operator. - - 1. Verify the node label was added: - - ```console - $ kubectl describe node | grep nvidia.com/gpu.workload.config - ``` - - *Example Output:* - - ```output - nvidia.com/gpu.workload.config: vm-passthrough - ``` - -After labeling the nodes, you can continue to the next steps to install Kata Containers and the NVIDIA GPU Operator. - -### Install the Kata Containers Helm Chart - -Install Kata Containers using the `kata-deploy` Helm chart. -The `kata-deploy` chart installs all required components from the Kata Containers project including the Kata Containers runtime binary, runtime configuration, UVM kernel, and images that NVIDIA uses for Kata Containers. - -The minimum required version is 3.29.0. - -1. Set the chart version and registry path: - - ```console - $ export VERSION="3.29.0" - $ export CHART="oci://ghcr.io/kata-containers/kata-deploy-charts/kata-deploy" - ``` - -1. Install the kata-deploy Helm chart: - - ```console - $ helm install kata-deploy "${CHART}" \ - --namespace kata-system --create-namespace \ - --set nfd.enabled=false \ - --wait --timeout 10m \ - --version "${VERSION}" - ``` - - *Example Output:* - - ```output - LAST DEPLOYED: Wed Apr 1 17:03:00 2026 - NAMESPACE: kata-system - STATUS: deployed - REVISION: 1 - DESCRIPTION: Install complete - TEST SUITE: None - ``` - - > [!NOTE] - > The `--wait` flag in the install command instructs Helm to wait until the release is deployed before returning. - > It can take a few minutes to return output. - - There is a [known Helm issue](https://github.com/helm/helm/issues/8660) on single node clusters, that may result in the Helm command finishing before all deployed pods are finished initializing. - If you are deploying to a single node cluster, you may need to wait for an additional few minutes after the Helm command completes for the `kata-deploy` pod to be in the Running state. - > [!NOTE] - > Both `kata-deploy` and the GPU Operator deploy Node Feature Discovery (NFD) by default. - > The install command includes `--set nfd.enabled=false` to prevent `kata-deploy` from deploying NFD. - > The GPU Operator will deploy and manage NFD in the next step. - - 1. Optional: Verify that the `kata-deploy` pod is running: - - ```console - $ kubectl get pods -n kata-system | grep kata-deploy - ``` - - *Example Output:* - - ```output - NAME READY STATUS RESTARTS AGE - kata-deploy-b2lzs 1/1 Running 0 6m37s - ``` - -1. Optional: Verify that the `kata-qemu-nvidia-gpu` runtime class is available: - - ```console - $ kubectl get runtimeclass | grep kata-qemu-nvidia-gpu - ``` - - *Example Output:* - - ```output - NAME HANDLER AGE - kata-qemu-nvidia-gpu kata-qemu-nvidia-gpu 40s - kata-qemu-nvidia-gpu-snp kata-qemu-nvidia-gpu-snp 40s - kata-qemu-nvidia-gpu-tdx kata-qemu-nvidia-gpu-tdx 40s - ``` - - Several runtime classes are installed by the `kata-deploy` chart. - The `kata-qemu-nvidia-gpu` runtime class is used with Kata Containers. - The `kata-qemu-nvidia-gpu-snp` and `kata-qemu-nvidia-gpu-tdx` runtime classes are used to deploy Confidential Containers. - - > [!NOTE] - > To manage the lifecycle of Kata Containers, including upgrades and day-two operations, - > install the [Kata Lifecycle Manager](https://github.com/kata-containers/lifecycle-manager). - > This Argo Workflows-based tool is the recommended way to manage Kata Containers deployments. - - 1. Optional: If you have an issue deploying the `kata-deploy` pod or are not seeing the expected runtime classes, get the pod name and view the logs: - - ```console - $ kubectl get pods -n kata-system | grep kata-deploy - $ kubectl logs -n kata-system - ``` - - Replace `` with the name of the `kata-deploy` pod from the first command's output. - -### Install the NVIDIA GPU Operator - -Install the NVIDIA GPU Operator and configure it to deploy Kata Container components. - -1. Add and update the NVIDIA Helm repository: - - ```console - $ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \ - && helm repo update - ``` - - *Example Output:* - - ```output - "nvidia" has been added to your repositories - Hang tight while we grab the latest from your chart repositories... - ...Successfully got an update from the "nvidia" chart repository - Update Complete. ⎈Happy Helming!⎈ - ``` - -> [!NOTE] -> Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). - -1. Install the GPU Operator. - The following configures the GPU Operator to deploy the operands that are required for Kata Containers. - Refer to Common Chart Customization Options for more details on the additional configuration options you can specify when installing the GPU Operator. - - ```console - $ helm install --generate-name \ - -n gpu-operator --create-namespace \ - nvidia/gpu-operator \ - --version= \ - --set sandboxWorkloads.enabled=true \ - --set sandboxWorkloads.mode=kata \ - --set nfd.enabled=true \ - --set nfd.nodefeaturerules=true - ``` - - *Example Output:* - - ```output - NAME: gpu-operator - LAST DEPLOYED: Wed Mar 25 17:21:34 2026 - NAMESPACE: gpu-operator - STATUS: deployed - REVISION: 1 - DESCRIPTION: Install complete - TEST SUITE: None - ``` - - > [!TIP] - > Add `--set sandboxWorkloads.defaultWorkload=vm-passthrough` if every worker node should use Kata by default. - - 1. Optional: Verify that all GPU Operator pods, especially the Sandbox Device Plugin and VFIO Manager operands, are running: - - ```console - $ kubectl get pods -n gpu-operator - ``` - - *Example Output:* - - ```output - NAME READY STATUS RESTARTS AGE - gpu-operator-1766001809-node-feature-discovery-gc-75776475sxzkp 1/1 Running 0 86s - gpu-operator-1766001809-node-feature-discovery-master-6869lxq2g 1/1 Running 0 86s - gpu-operator-1766001809-node-feature-discovery-worker-mh4cv 1/1 Running 0 86s - gpu-operator-f48fd66b-vtfrl 1/1 Running 0 86s - nvidia-cc-manager-7z74t 1/1 Running 0 61s - nvidia-kata-sandbox-device-plugin-daemonset-d5rvg 1/1 Running 0 30s - nvidia-sandbox-validator-6xnzc 1/1 Running 0 30s - nvidia-vfio-manager-h229x 1/1 Running 0 62s - ``` - - > [!NOTE] - > It can take several minutes for all GPU Operator pods to be in the Running state. - > If you are not seeing the expected output, you can view the logs for the GPU Operator pods: - - ```console - $ kubectl logs -n gpu-operator - ``` - - Replace `` with the name of the GPU Operator pod from `kubectl get pods -n gpu-operator`. - > [!NOTE] - > The NVIDIA Confidential Computing (CC) Manager for Kubernetes (`nvidia-cc-manager`) is deployed to all nodes configured to run Kata containers, even if you are not planning to run Confidential Containers. - > This manager sets the confidential computing mode on the NVIDIA GPUs, if your GPU is capable of Confidential Computing, but will not be used if you are deploying in Kata Containers only. - > Refer to Confidential Containers for more details. - - 1. Optional: If you have host access to the worker node, you can perform the following validation step: - - a. Confirm that the host uses the `vfio-pci` device driver for GPUs: - - ```console - $ lspci -nnk -d 10de: - ``` - - *Example Output:* - - ```output - 65:00.0 3D controller [0302]: NVIDIA Corporation xxxxxxx [xxx] [10de:xxxx] (rev xx) - Subsystem: NVIDIA Corporation xxxxxxx [xxx] [10de:xxxx] - Kernel driver in use: vfio-pci - Kernel modules: nvidiafb, nouveau - ``` - -### Optional: Configuring GPU or NVSwitch Resource Types Name - -By default, the NVIDIA GPU Operator creates a resource type for GPUs and NVSwitches, `nvidia.com/pgpu` and `nvidia.com/nvswitch`. -You can reference these names in your manifests to request GPU or NVSwitch resources for your workload. -If you want to use a different name, you can set the `P_GPU_ALIAS` or `NVSWITCH_ALIAS` environment variables in the Kata device plugin to your preferred name. -In clusters where all GPUs are the same model, a single resource type is typically sufficient. - -In heterogeneous clusters, where you have different GPU types on your nodes, you might want to use specific GPU types for your workload. -To do this, specify an empty `P_GPU_ALIAS` environment variable in the Kata device plugin by adding the following to your GPU Operator installation: -`--set kataSandboxDevicePlugin.env[0].name=P_GPU_ALIAS` and -`--set kataSandboxDevicePlugin.env[0].value=""`. - -When this variable is set to `""`, the Kata device plugin creates GPU model-specific resource types, for example `nvidia.com/GH100_H100L_94GB`, instead of the default `nvidia.com/pgpu` type. -Use the exposed device resource types in pod specs by specifying respective resource limits. - -Similarly, you can set `NVSWITCH_ALIAS` to `""` to advertise model-specific NVSwitch resource types. - -The following example installs the GPU Operator with both `P_GPU_ALIAS` and `NVSWITCH_ALIAS` configured: - -```console -$ helm install --generate-name \ - -n gpu-operator --create-namespace \ - nvidia/gpu-operator \ - --version= \ - --set sandboxWorkloads.enabled=true \ - --set sandboxWorkloads.mode=kata \ - --set nfd.enabled=true \ - --set nfd.nodefeaturerules=true \ - --set kataSandboxDevicePlugin.env[0].name=P_GPU_ALIAS \ - --set kataSandboxDevicePlugin.env[0].value="" \ - --set kataSandboxDevicePlugin.env[1].name=NVSWITCH_ALIAS \ - --set kataSandboxDevicePlugin.env[1].value="" -``` - -After installing the GPU Operator, you can view the GPU or NVSwitch resource types available on a node by running the following command: - -```console -$ kubectl get node -o json | grep nvidia.com -``` - -*Example Output:* - -```output -"nvidia.com/GH100_H100L_94GB": "1" -``` - -## Run a Sample Workload - -A pod specification for a Kata container requires the following: - -* Specify a Kata runtime class. - -* Specify a passthrough GPU resource. - -1. Create a file, such as `cuda-vectoradd-kata.yaml`, with the following content: - - ```yaml - apiVersion: v1 - kind: Pod - metadata: - name: cuda-vectoradd-kata - namespace: default - spec: - runtimeClassName: kata-qemu-nvidia-gpu - restartPolicy: OnFailure - containers: - - name: cuda-vectoradd - image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0-ubuntu22.04" - resources: - limits: - nvidia.com/pgpu: "1" - memory: 16Gi - ``` - -1. Create the pod: - - ```console - $ kubectl apply -f cuda-vectoradd-kata.yaml - ``` - - *Example Output:* - - ```output - pod/cuda-vectoradd-kata created - ``` - -1. Optional: Verify the pod is running: - - ```console - $ kubectl get pod cuda-vectoradd-kata - ``` - - *Example Output:* - - ```output - NAME READY STATUS RESTARTS AGE - cuda-vectoradd-kata 1/1 Running 0 10s - ``` - -1. View the pod logs: - - ```console - $ kubectl logs -n default cuda-vectoradd-kata - ``` - - *Example Output:* - - ```output - [Vector addition of 50000 elements] - Copy input data from the host memory to the CUDA device - CUDA kernel launch with 196 blocks of 256 threads - Copy output data from the CUDA device to the host memory - Test PASSED - Done - ``` - -1. Delete the pod: - - ```console - $ kubectl delete -f cuda-vectoradd-kata.yaml - ``` - -### Troubleshooting Workloads +> Full prerequisite detail (hardware/BIOS, IOMMU, driver removal, Helm install, feature-gate config) is in [references/prerequisites.md](references/prerequisites.md). -If the sample workload does not run, confirm that you labeled nodes to run virtual machines in containers: +## Activation -```console -$ kubectl get nodes -l nvidia.com/gpu.workload.config=vm-passthrough -``` +Do this first: identify which phase the user's request maps to in the Phases +table below, then **read the corresponding `references/.md` file before +acting**. All command sequences, manifest contents, and verification output +live only in those reference files — do not improvise commands from this +dispatch layer. -*Example Output:* +## Phases -```output -NAME STATUS ROLES AGE VERSION -kata-worker-1 Ready 10d v1.35.3 -kata-worker-2 Ready 10d v1.35.3 -kata-worker-3 Ready 10d v1.35.3 -``` +| Phase | Summary | Reference | +|-------|---------|-----------| +| Concepts | What Kata Containers are, benefits, limitations/restrictions, cluster topology (per-node component split), and the high-level configuration flow. | [references/concepts.md](references/concepts.md) | +| Prerequisites | Detailed hardware/BIOS, IOMMU configuration, host driver removal, Helm install, and the `KubeletPodResourcesGet` feature gate. | [references/prerequisites.md](references/prerequisites.md) | +| Install | Label nodes for Kata, install the upstream `kata-deploy` Helm chart, install the GPU Operator in Kata sandbox mode, and optionally configure GPU/NVSwitch resource type names. | [references/install.md](references/install.md) | +| Workload | Run a sample GPU workload with the `kata-qemu-nvidia-gpu` runtime class, verify it, and troubleshoot. | [references/workload.md](references/workload.md) | -You might have configured `vm-passthrough` as the default sandbox workload in the ClusterPolicy resource. -That setting applies the default sandbox workload cluster-wide, including for Kata when `mode` is `kata`. -Also confirm in the ClusterPolicy that `sandboxWorkloads` is configured for Kata as shown in the following example. +## Hard rules (apply across all phases) -```console -$ kubectl describe clusterpolicy | grep sandboxWorkloads -``` +- For GPU passthrough, all GPUs on a node must be assigned to one Kata VM; configuring only some GPUs per node is not supported. vGPU is not supported. +- NVIDIA supports the Operator and Kata Containers with the containerd runtime only. +- A labeled Kata node runs only Kata workloads — it cannot also run traditional GPU container workloads. +- Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). Never hardcode a version. -*Example Output:* +## Verification -```output -sandboxWorkloads: - enabled: true - defaultWorkload: vm-passthrough - mode: kata -``` +After install, confirm the Sandbox Device Plugin and VFIO Manager pods are +`Running`, then run the sample workload and confirm `Test PASSED`. Exact +commands and expected output are in [references/install.md](references/install.md) +and [references/workload.md](references/workload.md). diff --git a/gpu-operator/.agents/skills/gpu-operator-kata-containers/references/concepts.md b/gpu-operator/.agents/skills/gpu-operator-kata-containers/references/concepts.md new file mode 100644 index 000000000..4c6c12fd0 --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-kata-containers/references/concepts.md @@ -0,0 +1,78 @@ + + + +# Kata Containers with the GPU Operator: Concepts + +## About the Operator with Kata Containers + +[Kata Containers](https://katacontainers.io/) is an open source project that creates lightweight Virtual Machines (VMs) that feel and perform like traditional containers such as a Docker container. +A traditional container packages software for user-space isolation from the host, +but the container runs on the host and shares the operating system kernel with the host. +Sharing the operating system kernel is a potential vulnerability. + +A Kata container runs in a virtual machine on the host. +The virtual machine has a separate operating system and operating system kernel. +Hardware virtualization and a separate kernel provide improved workload isolation +in comparison with traditional containers. + +The NVIDIA GPU Operator works with the Kata container runtime. +Kata uses a hypervisor, such as QEMU, to provide a lightweight virtual machine with a single purpose: to run a Kubernetes pod. + +The following diagram shows the software components that Kubernetes uses to run a Kata container. + +```mermaid +flowchart LR + a[Kubelet] --> b[CRI] --> c[Kata\nRuntime] --> d[Lightweight\nQEMU VM] --> e[Lightweight\nGuest OS] --> f[Pod] --> g[Container] +``` + +> [!TIP] +> This page describes deploying with Kata containers only. +> Refer to the Confidential Containers documentation if you are interested in deploying Confidential Containers with Kata Containers and the GPU Operator. + +## Benefits of Using Kata Containers + +The primary benefits of Kata Containers are as follows: + +* Running untrusted workloads in a container. + The virtual machine provides a layer of defense against the untrusted code. + +* Limiting access to hardware devices such as NVIDIA GPUs. + The virtual machine is provided access to specific devices. + This approach ensures that the workload cannot access additional devices. + +* Transparent deployment of unmodified containers. + +## Limitations and Restrictions + +* For GPU passthrough workloads, all GPUs must be assigned to one Kata Container virtual machine. + Configuring only some GPUs on a node for Kata Containers is not supported. + vGPU is not supported. + +* Support for Kata Containers is limited to the implementation described on this page. + The Operator offers Technology Preview support for Red Hat OpenShift Sandboxed Containers v1.12. + +* NVIDIA supports the Operator and Kata Containers with the containerd runtime only. + +## Cluster Topology Considerations + +You can configure all the worker nodes in your cluster for Kata Containers or you can configure some nodes for Kata Containers and others for traditional containers. +Consider the following example where node A is configured to run traditional containers and node B is configured to run Kata Containers. + +| Node A - Traditional Container nodes receive the following software components | Node B - Kata Container nodes receive the following software components | +| --- | --- | +| * `NVIDIA Driver Manager for Kubernetes` -- to install the data-center driver. * `NVIDIA Container Toolkit` -- to ensure that containers can access GPUs. * `NVIDIA Device Plugin for Kubernetes` -- to discover and advertise GPU resources to kubelet. * `NVIDIA DCGM and DCGM Exporter` -- to monitor GPUs. * `NVIDIA MIG Manager for Kubernetes` -- to manage MIG-capable GPUs. * `Node Feature Discovery` -- to detect CPU, kernel, and host features and label worker nodes. * `NVIDIA GPU Feature Discovery` -- to detect NVIDIA GPUs and label worker nodes. | * `NVIDIA Confidential Computing Manager for Kubernetes` -- to set the confidential computing (CC) mode on the NVIDIA GPUs. This component is deployed to all nodes configured for Kata Containers, even if you are not planning to run Confidential Containers. Refer to the Confidential Containers documentation for more details. * `NVIDIA Sandbox Device Plugin` -- to discover and advertise the passthrough GPUs to kubelet. * `NVIDIA VFIO Manager` -- to bind NVIDIA GPUs and NVIDIA NVSwitches to the vfio-pci driver for VFIO passthrough. * `Node Feature Discovery` -- to detect CPU security features, NVIDIA GPUs, and label worker nodes. | + +This configuration can be controlled through node labelling, as described in the Label Nodes section. +You can also set `sandboxWorkloads.defaultWorkload=vm-passthrough` when you install the GPU Operator to configure all nodes to run Kata Containers by default. + +## Overview of the configuration flow + +To enable Kata Containers for GPUs on your cluster, you do the following: + +1. Make sure your cluster meets the prerequisites (see [references/prerequisites.md](prerequisites.md)). +1. Label the nodes you want to use for Kata Containers (see [references/install.md](install.md)). +1. Install the upstream `kata-deploy` Helm chart, which deploys all Kata runtime classes, including NVIDIA-specific runtime classes. + The `kata-qemu-nvidia-gpu` runtime class is used with Kata Containers. +1. Install the NVIDIA GPU Operator with Kata sandbox mode enabled. + +After installation, you can run a sample workload that uses the Kata runtime class (see [references/workload.md](workload.md)). diff --git a/gpu-operator/.agents/skills/gpu-operator-kata-containers/references/install.md b/gpu-operator/.agents/skills/gpu-operator-kata-containers/references/install.md new file mode 100644 index 000000000..e202a1f12 --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-kata-containers/references/install.md @@ -0,0 +1,291 @@ + + + +# Configure and Install Kata Containers with the GPU Operator + +## Label Nodes to use Kata Containers + +1. Get a list of the nodes in your cluster: + + ```console + $ kubectl get nodes + ``` + + *Example Output:* + + ```output + NAME STATUS ROLES AGE VERSION + node-01 Ready 10d v1.34.0 + node-02 Ready 10d v1.34.0 + ``` + +1. Label the nodes you want to use for Kata Containers: + + ```console + $ kubectl label node nvidia.com/gpu.workload.config=vm-passthrough + ``` + + The GPU Operator uses this label to determine what software components to deploy to a node. + The `nvidia.com/gpu.workload.config=vm-passthrough` label specifies that the node should receive the software components to run Kata Containers. + A node can only run one container runtime at a time, so a labeled node runs only Kata container workloads and cannot run traditional GPU container workloads. + The labeling approach is useful if you want to run Kata container workloads on some nodes and traditional GPU container workloads on other nodes in your cluster. + Refer to the GPU Operator Cluster Topology Considerations section for more details on what gets deployed to a Kata Container node. + + > [!TIP] + > Skip this section if you plan to set `sandboxWorkloads.defaultWorkload=vm-passthrough` when you install the GPU Operator. + + 1. Verify the node label was added: + + ```console + $ kubectl describe node | grep nvidia.com/gpu.workload.config + ``` + + *Example Output:* + + ```output + nvidia.com/gpu.workload.config: vm-passthrough + ``` + +After labeling the nodes, you can continue to the next steps to install Kata Containers and the NVIDIA GPU Operator. + +## Install the Kata Containers Helm Chart + +Install Kata Containers using the `kata-deploy` Helm chart. +The `kata-deploy` chart installs all required components from the Kata Containers project including the Kata Containers runtime binary, runtime configuration, UVM kernel, and images that NVIDIA uses for Kata Containers. + +The minimum required version is 3.29.0. + +1. Set the chart version and registry path: + + ```console + $ export VERSION="3.29.0" + $ export CHART="oci://ghcr.io/kata-containers/kata-deploy-charts/kata-deploy" + ``` + +1. Install the kata-deploy Helm chart: + + ```console + $ helm install kata-deploy "${CHART}" \ + --namespace kata-system --create-namespace \ + --set nfd.enabled=false \ + --wait --timeout 10m \ + --version "${VERSION}" + ``` + + *Example Output:* + + ```output + LAST DEPLOYED: Wed Apr 1 17:03:00 2026 + NAMESPACE: kata-system + STATUS: deployed + REVISION: 1 + DESCRIPTION: Install complete + TEST SUITE: None + ``` + + > [!NOTE] + > The `--wait` flag in the install command instructs Helm to wait until the release is deployed before returning. + > It can take a few minutes to return output. + + There is a [known Helm issue](https://github.com/helm/helm/issues/8660) on single node clusters, that may result in the Helm command finishing before all deployed pods are finished initializing. + If you are deploying to a single node cluster, you may need to wait for an additional few minutes after the Helm command completes for the `kata-deploy` pod to be in the Running state. + > [!NOTE] + > Both `kata-deploy` and the GPU Operator deploy Node Feature Discovery (NFD) by default. + > The install command includes `--set nfd.enabled=false` to prevent `kata-deploy` from deploying NFD. + > The GPU Operator will deploy and manage NFD in the next step. + + 1. Optional: Verify that the `kata-deploy` pod is running: + + ```console + $ kubectl get pods -n kata-system | grep kata-deploy + ``` + + *Example Output:* + + ```output + NAME READY STATUS RESTARTS AGE + kata-deploy-b2lzs 1/1 Running 0 6m37s + ``` + +1. Optional: Verify that the `kata-qemu-nvidia-gpu` runtime class is available: + + ```console + $ kubectl get runtimeclass | grep kata-qemu-nvidia-gpu + ``` + + *Example Output:* + + ```output + NAME HANDLER AGE + kata-qemu-nvidia-gpu kata-qemu-nvidia-gpu 40s + kata-qemu-nvidia-gpu-snp kata-qemu-nvidia-gpu-snp 40s + kata-qemu-nvidia-gpu-tdx kata-qemu-nvidia-gpu-tdx 40s + ``` + + Several runtime classes are installed by the `kata-deploy` chart. + The `kata-qemu-nvidia-gpu` runtime class is used with Kata Containers. + The `kata-qemu-nvidia-gpu-snp` and `kata-qemu-nvidia-gpu-tdx` runtime classes are used to deploy Confidential Containers. + + > [!NOTE] + > To manage the lifecycle of Kata Containers, including upgrades and day-two operations, + > install the [Kata Lifecycle Manager](https://github.com/kata-containers/lifecycle-manager). + > This Argo Workflows-based tool is the recommended way to manage Kata Containers deployments. + + 1. Optional: If you have an issue deploying the `kata-deploy` pod or are not seeing the expected runtime classes, get the pod name and view the logs: + + ```console + $ kubectl get pods -n kata-system | grep kata-deploy + $ kubectl logs -n kata-system + ``` + + Replace `` with the name of the `kata-deploy` pod from the first command's output. + +## Install the NVIDIA GPU Operator + +Install the NVIDIA GPU Operator and configure it to deploy Kata Container components. + +1. Add and update the NVIDIA Helm repository: + + ```console + $ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \ + && helm repo update + ``` + + *Example Output:* + + ```output + "nvidia" has been added to your repositories + Hang tight while we grab the latest from your chart repositories... + ...Successfully got an update from the "nvidia" chart repository + Update Complete. ⎈Happy Helming!⎈ + ``` + +> [!NOTE] +> Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). + +1. Install the GPU Operator. + The following configures the GPU Operator to deploy the operands that are required for Kata Containers. + Refer to Common Chart Customization Options for more details on the additional configuration options you can specify when installing the GPU Operator. + + ```console + $ helm install --generate-name \ + -n gpu-operator --create-namespace \ + nvidia/gpu-operator \ + --version= \ + --set sandboxWorkloads.enabled=true \ + --set sandboxWorkloads.mode=kata \ + --set nfd.enabled=true \ + --set nfd.nodefeaturerules=true + ``` + + *Example Output:* + + ```output + NAME: gpu-operator + LAST DEPLOYED: Wed Mar 25 17:21:34 2026 + NAMESPACE: gpu-operator + STATUS: deployed + REVISION: 1 + DESCRIPTION: Install complete + TEST SUITE: None + ``` + + > [!TIP] + > Add `--set sandboxWorkloads.defaultWorkload=vm-passthrough` if every worker node should use Kata by default. + + 1. Optional: Verify that all GPU Operator pods, especially the Sandbox Device Plugin and VFIO Manager operands, are running: + + ```console + $ kubectl get pods -n gpu-operator + ``` + + *Example Output:* + + ```output + NAME READY STATUS RESTARTS AGE + gpu-operator-1766001809-node-feature-discovery-gc-75776475sxzkp 1/1 Running 0 86s + gpu-operator-1766001809-node-feature-discovery-master-6869lxq2g 1/1 Running 0 86s + gpu-operator-1766001809-node-feature-discovery-worker-mh4cv 1/1 Running 0 86s + gpu-operator-f48fd66b-vtfrl 1/1 Running 0 86s + nvidia-cc-manager-7z74t 1/1 Running 0 61s + nvidia-kata-sandbox-device-plugin-daemonset-d5rvg 1/1 Running 0 30s + nvidia-sandbox-validator-6xnzc 1/1 Running 0 30s + nvidia-vfio-manager-h229x 1/1 Running 0 62s + ``` + + > [!NOTE] + > It can take several minutes for all GPU Operator pods to be in the Running state. + > If you are not seeing the expected output, you can view the logs for the GPU Operator pods: + + ```console + $ kubectl logs -n gpu-operator + ``` + + Replace `` with the name of the GPU Operator pod from `kubectl get pods -n gpu-operator`. + > [!NOTE] + > The NVIDIA Confidential Computing (CC) Manager for Kubernetes (`nvidia-cc-manager`) is deployed to all nodes configured to run Kata containers, even if you are not planning to run Confidential Containers. + > This manager sets the confidential computing mode on the NVIDIA GPUs, if your GPU is capable of Confidential Computing, but will not be used if you are deploying in Kata Containers only. + > Refer to Confidential Containers for more details. + + 1. Optional: If you have host access to the worker node, you can perform the following validation step: + + a. Confirm that the host uses the `vfio-pci` device driver for GPUs: + + ```console + $ lspci -nnk -d 10de: + ``` + + *Example Output:* + + ```output + 65:00.0 3D controller [0302]: NVIDIA Corporation xxxxxxx [xxx] [10de:xxxx] (rev xx) + Subsystem: NVIDIA Corporation xxxxxxx [xxx] [10de:xxxx] + Kernel driver in use: vfio-pci + Kernel modules: nvidiafb, nouveau + ``` + +## Optional: Configuring GPU or NVSwitch Resource Types Name + +By default, the NVIDIA GPU Operator creates a resource type for GPUs and NVSwitches, `nvidia.com/pgpu` and `nvidia.com/nvswitch`. +You can reference these names in your manifests to request GPU or NVSwitch resources for your workload. +If you want to use a different name, you can set the `P_GPU_ALIAS` or `NVSWITCH_ALIAS` environment variables in the Kata device plugin to your preferred name. +In clusters where all GPUs are the same model, a single resource type is typically sufficient. + +In heterogeneous clusters, where you have different GPU types on your nodes, you might want to use specific GPU types for your workload. +To do this, specify an empty `P_GPU_ALIAS` environment variable in the Kata device plugin by adding the following to your GPU Operator installation: +`--set kataSandboxDevicePlugin.env[0].name=P_GPU_ALIAS` and +`--set kataSandboxDevicePlugin.env[0].value=""`. + +When this variable is set to `""`, the Kata device plugin creates GPU model-specific resource types, for example `nvidia.com/GH100_H100L_94GB`, instead of the default `nvidia.com/pgpu` type. +Use the exposed device resource types in pod specs by specifying respective resource limits. + +Similarly, you can set `NVSWITCH_ALIAS` to `""` to advertise model-specific NVSwitch resource types. + +The following example installs the GPU Operator with both `P_GPU_ALIAS` and `NVSWITCH_ALIAS` configured: + +```console +$ helm install --generate-name \ + -n gpu-operator --create-namespace \ + nvidia/gpu-operator \ + --version= \ + --set sandboxWorkloads.enabled=true \ + --set sandboxWorkloads.mode=kata \ + --set nfd.enabled=true \ + --set nfd.nodefeaturerules=true \ + --set kataSandboxDevicePlugin.env[0].name=P_GPU_ALIAS \ + --set kataSandboxDevicePlugin.env[0].value="" \ + --set kataSandboxDevicePlugin.env[1].name=NVSWITCH_ALIAS \ + --set kataSandboxDevicePlugin.env[1].value="" +``` + +After installing the GPU Operator, you can view the GPU or NVSwitch resource types available on a node by running the following command: + +```console +$ kubectl get node -o json | grep nvidia.com +``` + +*Example Output:* + +```output +"nvidia.com/GH100_H100L_94GB": "1" +``` diff --git a/gpu-operator/.agents/skills/gpu-operator-kata-containers/references/prerequisites.md b/gpu-operator/.agents/skills/gpu-operator-kata-containers/references/prerequisites.md new file mode 100644 index 000000000..9fb0d1be7 --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-kata-containers/references/prerequisites.md @@ -0,0 +1,87 @@ + + + +# Kata Containers Detailed Prerequisites + +## Hardware and BIOS + +* Ensure hosts are configured to enable hardware virtualization and Access Control Services (ACS). + With some AMD CPUs and BIOSes, ACS might be grouped under Advanced Error Reporting (AER). + Enabling these features is typically performed by configuring the host BIOS. + +* Configure hosts to support IOMMU. + You can check if your host is configured for IOMMU by running the following command: + + ```console + $ ls /sys/kernel/iommu_groups + ``` + + If the output of this command includes 0, 1, and so on, then your host is configured for IOMMU. + + If the host is not configured or if you are unsure, add the `intel_iommu=on` (or `amd_iommu=on` for AMD CPUs) Linux kernel command-line argument. + For most Linux distributions, add the argument to the `/etc/default/grub` file: + + ```text + ... + GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on modprobe.blacklist=nouveau" + ... + ``` + + On Ubuntu systems, run `sudo update-grub` after making the change to configure the bootloader. + On other systems, you might need to run `sudo dracut` after making the change. + Refer to the documentation for your operating system. + Reboot the host after configuring the bootloader. + + > [!NOTE] + > After configuring IOMMU, you might see QEMU warnings about PCI P2P DMA when running GPU workloads. + > These are expected and can be safely ignored. + > * Ensure that no NVIDIA GPU drivers are installed on the host. + > Kata Containers uses VFIO to pass GPUs directly to the VM, and host-level GPU drivers interfere with VFIO device binding. + + To check if NVIDIA GPU drivers are installed, run the following command: + + ```console + $ lsmod | grep nvidia + ``` + + If the output is empty, no NVIDIA GPU drivers are loaded. + If modules such as `nvidia`, `nvidia_uvm`, or `nvidia_modeset` are listed, NVIDIA GPU drivers are present and must be removed before proceeding. + Refer to [Removing the Driver](https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/removing-the-driver.html) in the NVIDIA Driver Installation Guide. + +## Kubernetes Cluster + +* A Kubernetes cluster with cluster administrator privileges. + +* Helm installed on your cluster. + Use the command below to install Helm or refer to the [Helm documentation](https://helm.sh/docs/intro/install/) for installation instructions. + + ```console + $ curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \ + && chmod 700 get_helm.sh \ + && ./get_helm.sh + ``` + +* Enable the `KubeletPodResourcesGet` Kubelet feature gate on your cluster. + The Kata runtime uses this feature gate to query the Kubelet Pod Resources API and discover allocated GPU devices during sandbox creation. + + * For Kubernetes v1.34 and later, the `KubeletPodResourcesGet` feature gate is enabled by default. + + * For Kubernetes versions older than v1.34, you must explicitly enable the `KubeletPodResourcesGet` feature gate. + Add the feature gate to your Kubelet configuration (typically `/var/lib/kubelet/config.yaml`): + + ```yaml + apiVersion: kubelet.config.k8s.io/v1beta1 + kind: KubeletConfiguration + featureGates: + KubeletPodResourcesGet: true + ``` + + If your `config.yaml` already has a `featureGates` section, add the gate to the existing section rather than creating a duplicate. + + Restart the Kubelet service to apply the changes: + + ```console + $ sudo systemctl restart kubelet + ``` + + Refer to the [Kata Containers documentation](https://github.com/kata-containers/kata-containers/blob/main/docs/use-cases/NVIDIA-GPU-passthrough-and-Kata-QEMU.md#kata-runtime) for more details on the Kata runtime and VFIO cold-plug. diff --git a/gpu-operator/.agents/skills/gpu-operator-kata-containers/references/workload.md b/gpu-operator/.agents/skills/gpu-operator-kata-containers/references/workload.md new file mode 100644 index 000000000..2ab79e504 --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-kata-containers/references/workload.md @@ -0,0 +1,114 @@ + + + +# Run a Sample Kata Workload and Troubleshoot + +## Run a Sample Workload + +A pod specification for a Kata container requires the following: + +* Specify a Kata runtime class. + +* Specify a passthrough GPU resource. + +1. Create a file, such as `cuda-vectoradd-kata.yaml`, with the following content: + + ```yaml + apiVersion: v1 + kind: Pod + metadata: + name: cuda-vectoradd-kata + namespace: default + spec: + runtimeClassName: kata-qemu-nvidia-gpu + restartPolicy: OnFailure + containers: + - name: cuda-vectoradd + image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0-ubuntu22.04" + resources: + limits: + nvidia.com/pgpu: "1" + memory: 16Gi + ``` + +1. Create the pod: + + ```console + $ kubectl apply -f cuda-vectoradd-kata.yaml + ``` + + *Example Output:* + + ```output + pod/cuda-vectoradd-kata created + ``` + +1. Optional: Verify the pod is running: + + ```console + $ kubectl get pod cuda-vectoradd-kata + ``` + + *Example Output:* + + ```output + NAME READY STATUS RESTARTS AGE + cuda-vectoradd-kata 1/1 Running 0 10s + ``` + +1. View the pod logs: + + ```console + $ kubectl logs -n default cuda-vectoradd-kata + ``` + + *Example Output:* + + ```output + [Vector addition of 50000 elements] + Copy input data from the host memory to the CUDA device + CUDA kernel launch with 196 blocks of 256 threads + Copy output data from the CUDA device to the host memory + Test PASSED + Done + ``` + +1. Delete the pod: + + ```console + $ kubectl delete -f cuda-vectoradd-kata.yaml + ``` + +## Troubleshooting Workloads + +If the sample workload does not run, confirm that you labeled nodes to run virtual machines in containers: + +```console +$ kubectl get nodes -l nvidia.com/gpu.workload.config=vm-passthrough +``` + +*Example Output:* + +```output +NAME STATUS ROLES AGE VERSION +kata-worker-1 Ready 10d v1.35.3 +kata-worker-2 Ready 10d v1.35.3 +kata-worker-3 Ready 10d v1.35.3 +``` + +You might have configured `vm-passthrough` as the default sandbox workload in the ClusterPolicy resource. +That setting applies the default sandbox workload cluster-wide, including for Kata when `mode` is `kata`. +Also confirm in the ClusterPolicy that `sandboxWorkloads` is configured for Kata as shown in the following example. + +```console +$ kubectl describe clusterpolicy | grep sandboxWorkloads +``` + +*Example Output:* + +```output +sandboxWorkloads: + enabled: true + defaultWorkload: vm-passthrough + mode: kata +``` diff --git a/gpu-operator/.agents/skills/gpu-operator-kubevirt/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-kubevirt/SKILL.md index d20aff45a..8a8c743b8 100644 --- a/gpu-operator/.agents/skills/gpu-operator-kubevirt/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-kubevirt/SKILL.md @@ -20,6 +20,10 @@ tags: # GPU Operator with KubeVirt +Provision worker nodes for GPU-accelerated virtual machines with KubeVirt using +the GPU Operator, supporting both GPU passthrough and NVIDIA vGPU workloads +alongside container workloads in the same cluster. + ## Prerequisites Before using KubeVirt with the GPU Operator, ensure the following prerequisites are configured on your cluster and nodes: @@ -28,488 +32,37 @@ Before using KubeVirt with the GPU Operator, ensure the following prerequisites - The host is booted with `intel_iommu=on` or `amd_iommu=on` on the kernel command line. - If planning to use NVIDIA vGPU, SR-IOV must be enabled in the BIOS if your GPUs are based on the NVIDIA Ampere architecture or later. Refer to the [NVIDIA vGPU Documentation](https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#prereqs-vgpu) to ensure you have met all the prerequisites for using NVIDIA vGPU. - KubeVirt is installed in the cluster. -- Starting with KubeVirt v0.58.2 and v0.59.1, set the `DisableMDEVConfiguration` feature gate: - - ```console - $ kubectl patch kubevirt -n kubevirt kubevirt --type='json' \ - -p='[{"op": "add", "path": "/spec/configuration/developerConfiguration/featureGates/-", "value": "DisableMDEVConfiguration" }]' - ``` - -## About the Operator with KubeVirt - -[KubeVirt](https://kubevirt.io/) is a virtual machine management add-on to Kubernetes that allows you to run and manage virtual machines in a Kubernetes cluster. -It eliminates the need to manage separate clusters for virtual machine and container workloads because both can now coexist in a single Kubernetes cluster. - -In addition to the GPU Operator being able to provision worker nodes for running GPU-accelerated containers, the GPU Operator can also be used to provision worker nodes for running GPU-accelerated virtual machines with KubeVirt. - -There are some different prerequisites required when running virtual machines with GPUs compared to running containers with GPUs. -The primary difference is the drivers required. -For example, the datacenter driver is needed for containers, the vfio-pci driver is needed for GPU passthrough, and the [NVIDIA vGPU Manager](https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#installing-configuring-grid-vgpu) is needed for creating vGPU devices. - -### Configure Worker Nodes for GPU Operator components - -The GPU Operator can now be configured to deploy different software components on worker nodes depending on what GPU workload is configured to run on those nodes. -This is configured by adding a `nvidia.com/gpu.workload.config` label to the worker node with the value of `container`, `vm-passthrough`, or `vm-vgpu` depending on if you are planning to use vGPU or not. -The GPU Operator will use the label to determine which software components to deploy on the worker nodes. - -Given the following node configuration: - -* Node A is configured with the label `nvidia.com/gpu.workload.config=container` and configured to run containers. -* Node B is configured with the label `nvidia.com/gpu.workload.config=vm-passthrough` and configured to run virtual machines with Passthrough GPU. -* Node C is configured with the label `nvidia.com/gpu.workload.config=vm-vgpu` and configured to run virtual machines with vGPU. - -The GPU Operator will deploy the following software components on each node: - -* Node A receives the following software components: - * `NVIDIA Datacenter Driver` - to install the driver - * `NVIDIA Container Toolkit` - to ensure containers can properly access GPUs - * `NVIDIA Kubernetes Device Plugin` - to discover and advertise GPU resources to kubelet - * `NVIDIA DCGM and DCGM Exporter` - to monitor the GPU(s) - -* Node B receives the following software components: - * `VFIO Manager` - to load `vfio-pci` and bind it to all GPUs on the node - * `Sandbox Device Plugin` - to discover and advertise the passthrough GPUs to kubelet - -* Node C receives the following software components: - * `NVIDIA vGPU Manager` - to install the driver - * `NVIDIA vGPU Device Manager` - to create vGPU devices on the node - * `Sandbox Device Plugin` - to discover and advertise the vGPU devices to kubelet - -If the node label `nvidia.com/gpu.workload.config` does not exist on the node, the GPU Operator will assume the default GPU workload configuration, `container`, and will deploy the software components needed to support this workload type. -To override the default GPU workload configuration, set the following value in `ClusterPolicy`: `sandboxWorkloads.defaultWorkload=`. - -### Assumptions, constraints, and dependencies - -* A GPU worker node can run GPU workloads of a particular type, such as containers, virtual machines with GPU Passthrough, or virtual machines with vGPU, but not a combination of any of them. - -* The cluster admin or developer has knowledge about their cluster ahead of time and can properly label nodes to indicate what types of GPU workloads they will run. - -* Worker nodes running GPU accelerated virtual machines (with GPU passthrough or vGPU) are assumed to be bare metal. - -* The GPU Operator will not automate the installation of NVIDIA drivers inside KubeVirt virtual machines with GPUs/vGPUs attached. - -* Users must manually add all passthrough GPU and vGPU resources to the `permittedDevices` list in the KubeVirt CR before assigning them to KubeVirt virtual machines. Refer to the [KubeVirt documentation](https://kubevirt.io/user-guide/compute/host-devices/#listing-permitted-devices) for more information. - -## Configure KubeVirt with the GPU Operator - -After configuring the prerequisites, the high level workflow for using the GPU Operator with KubeVirt is as follows: - -* Label worker nodes based on the GPU workloads they will run. -* Install the GPU Operator and set `sandboxWorkloads.enabled=true` - -If you are planning to deploy VMs with vGPU, the workflow is as follows: - -* Build the NVIDIA vGPU Manager image -* Label the node for the vGPU configuration -* Add vGPU resources to KubeVirt CR -* Create a virtual machine with vGPU - -If you are planning to deploy VMs with GPU passthrough, the workflow is as follows: - -* Add GPU passthrough resources to KubeVirt CR -* Create a virtual machine with GPU passthrough - -### Label worker nodes - -The GPU Operator uses the value of the `nvidia.com/gpu.workload.config` label to determine which operands to deploy on your worker node. - -1. Add a `nvidia.com/gpu.workload.config` label to a worker node: - - ```console - $ kubectl label node --overwrite nvidia.com/gpu.workload.config=vm-vgpu - ``` - - You can assign the following values to the label: - - * `container` - * `vm-passthrough` - * `vm-vgpu` - - Refer to the Configure Worker Nodes for GPU Operator components section for more information on the different configurations options. - -### Install the GPU Operator - -Follow one of the below subsections for installing the GPU Operator, depending on whether you plan to use NVIDIA vGPU or not. - -> [!NOTE] -> The following commands set the `sandboxWorkloads.enabled` flag. -> This `ClusterPolicy` flag controls whether the GPU Operator can provision GPU worker nodes for virtual machine workloads, in addition to container workloads. -> This flag is disabled by default, meaning all nodes get provisioned with the same software to enable container workloads, and the `nvidia.com/gpu.workload.config` node label is not used. - -The term *sandboxing* refers to running software in a separate isolated environment, typically for added security (that is, a virtual machine). -We use the term `sandbox workloads` to signify workloads that run in a virtual machine, irrespective of the virtualization technology used. -#### Install the GPU Operator without NVIDIA vGPU - -> [!NOTE] -> Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). - -Install the GPU Operator, enabling `sandboxWorkloads`: - -```console -$ helm install --wait --generate-name \ - -n gpu-operator --create-namespace \ - nvidia/gpu-operator \ - --version= \ - --set sandboxWorkloads.enabled=true -``` - -#### Install the GPU Operator with NVIDIA vGPU - -Before installing the GPU Operator with NVIDIA vGPU, you must build a private NVIDIA vGPU Manager container image and push to a private registry. -Follow the steps provided in this section. - -1. Create a namespace for GPU Operator: - - ```console - $ kubectl create namespace gpu-operator - ``` - -1. Create an ImagePullSecret for accessing the NVIDIA vGPU Manager image: - - ```console - $ kubectl create secret docker-registry ${REGISTRY_SECRET_NAME} \ - --docker-server=${PRIVATE_REGISTRY} --docker-username= \ - --docker-password= \ - --docker-email= -n gpu-operator - ``` - -1. Install the GPU Operator with `sandboxWorkloads` and `vgpuManager` enabled and specify the NVIDIA vGPU Manager image built previously: - - ```console - $ helm install --wait --generate-name \ - -n gpu-operator --create-namespace \ - nvidia/gpu-operator \ - --version= \ - --set sandboxWorkloads.enabled=true \ - --set vgpuManager.enabled=true \ - --set vgpuManager.repository= \ - --set vgpuManager.image=vgpu-manager \ - --set vgpuManager.version= \ - --set vgpuManager.imagePullSecrets={${REGISTRY_SECRET_NAME}} - ``` - -The vGPU Device Manager, deployed by the GPU Operator, automatically creates vGPU devices that can be assigned to KubeVirt virtual machines. -Without additional configuration, the GPU Operator creates a default set of devices on all GPUs. -To learn more about the vGPU Device Manager and configure which types of vGPU devices get created in your cluster, refer to vGPU Device Configuration. - -### Add GPU resources to KubeVirt CR -Follow one of the below subsections for adding GPU resources to the KubeVirt CR, depending on whether you plan to use NVIDIA vGPU or not. - -#### Add vGPU resources to KubeVirt CR - -Update the KubeVirt custom resource so that all vGPU devices in your cluster are permitted and can be assigned to virtual machines. - -The following example shows how to permit the A10-12Q vGPU device, the device names for the GPUs on your cluster will likely be different. - -1. Determine the resource names for the GPU devices: - - ```console - $ kubectl get node cnt-server-2 -o json | jq '.status.allocatable | with_entries(select(.key | startswith("nvidia.com/"))) | with_entries(select(.value != "0"))' - ``` - - *Example Output* - - ```output - { - "nvidia.com/NVIDIA_A10-12Q": "4" - } - ``` - -1. Determine the PCI device IDs for the GPUs. - - * You can search by device name in the [PCI IDs database](https://pci-ids.ucw.cz/v2.2/pci.ids). - - * If you have host access to the node, you can list the NVIDIA GPU devices with a command like the following example: - - ```console - $ lspci -nnk -d 10de: - ``` - - *Example Output* - - ```output - 65:00.0 3D controller [0302]: NVIDIA Corporation GA102GL [A10] [10de:2236] (rev a1) - Subsystem: NVIDIA Corporation GA102GL [A10] [10de:1482] - Kernel modules: nvidiafb, nouveau - ``` - -1. Modify the `KubeVirt` custom resource like the following partial example. - - ```yaml - ... - spec: - configuration: - developerConfiguration: - featureGates: - - GPU - - DisableMDEVConfiguration - permittedHostDevices: # Defines VM devices to import. - mediatedDevices: # Include for vGPU - - externalResourceProvider: true - mdevNameSelector: NVIDIA A10-12Q - resourceName: nvidia.com/NVIDIA_A10-12Q - ... - ``` - - Replace the values in the YAML as follows: - - * `mdevNameSelector` and `resourceName` under `mediatedDevices` to correspond to your vGPU type. - - * Set `externalResourceProvider=true` to indicate that this resource is provided by an external device plugin, in this case the `sandbox-device-plugin` that is deployed by the GPU Operator. - -Refer to the [KubeVirt user guide](https://kubevirt.io/user-guide/virtual_machines/host-devices/#listing-permitted-devices) for more information on the configuration options. - -#### Add GPU passthrough resources to KubeVirt CR - -Update the KubeVirt custom resource so that all GPU passthrough devices in your cluster are permitted and can be assigned to virtual machines. - -The following example shows how to permit the A10 GPU device, the device names for the GPUs on your cluster will likely be different. - -1. Determine the resource names for the GPU devices: - - ```console - $ kubectl get node cnt-server-2 -o json | jq '.status.allocatable | with_entries(select(.key | startswith("nvidia.com/"))) | with_entries(select(.value != "0"))' - ``` - - *Example Output* - - ```output - { - "nvidia.com/GA102GL_A10": "1" - } - ``` - -1. Determine the PCI device IDs for the GPUs. - - * You can search by device name in the [PCI IDs database](https://pci-ids.ucw.cz/v2.2/pci.ids). - - * If you have host access to the node, you can list the NVIDIA GPU devices with a command like the following example: - - ```console - $ lspci -nnk -d 10de: - ``` - - *Example Output* - - ```output - 65:00.0 3D controller [0302]: NVIDIA Corporation GA102GL [A10] [10de:2236] (rev a1) - Subsystem: NVIDIA Corporation GA102GL [A10] [10de:1482] - Kernel modules: nvidiafb, nouveau - ``` - -1. Modify the `KubeVirt` custom resource like the following partial example. - - ```yaml - ... - spec: - configuration: - developerConfiguration: - featureGates: - - GPU - - DisableMDEVConfiguration - permittedHostDevices: # Defines VM devices to import. - pciHostDevices: # Include for GPU passthrough - - externalResourceProvider: true - pciVendorSelector: 10DE:2236 - resourceName: nvidia.com/GA102GL_A10 - ... - ``` - - Replace the values in the YAML as follows: - - * `pciVendorSelector` and `resourceName` under `pciHostDevices` to correspond to your GPU model. - - * Set `externalResourceProvider=true` to indicate that this resource is provided by an external device plugin, in this case the `sandbox-device-plugin` that is deployed by the GPU Operator. - -Refer to the [KubeVirt user guide](https://kubevirt.io/user-guide/virtual_machines/host-devices/#listing-permitted-devices) for more information on the configuration options. - -### Create a virtual machine with GPU - -After the `sandbox-device-plugin` pod is running on your worker nodes and the GPU resources have been added to the -KubeVirt allowlist, you can assign a GPU to a virtual machine by editing the `spec.domain.devices.gpus` field -in the `VirtualMachineInstance` manifest. - -Example for GPU passthrough: - -```yaml -apiVersion: kubevirt.io/v1alpha3 -kind: VirtualMachineInstance -... -spec: - domain: - devices: - gpus: - - deviceName: nvidia.com/GA102GL_A10 - name: gpu1 -... -``` - -Example for vGPU: - -```yaml -apiVersion: kubevirt.io/v1alpha3 -kind: VirtualMachineInstance -... -spec: - domain: - devices: - gpus: - - deviceName: nvidia.com/NVIDIA_A10-12Q - name: gpu1 -... -``` - -* `deviceName` is the resource name representing the device. - -* `name` is a name to identify the device in the virtual machine - -## vGPU Device Configuration - -The vGPU Device Manager assists in creating vGPU devices on GPU worker nodes. -The vGPU Device Manager allows administrators to declaratively define a set of possible vGPU device configurations they would like applied to GPUs on a node. -At runtime, adminstrators then point the vGPU Device Manager at one of these configurations, and vGPU Device Manager takes care of applying it. - -The configuration file is created as a ConfigMap, and is shared across all worker nodes. -At runtime, a node label, `nvidia.com/vgpu.config`, can be used to decide which of these configurations to actually apply to a node at any given time. -If the node is not labeled, then the `default` configuration will be used. -For more information on this component and how it is configured, refer to the [NVIDIA vGPU Device Manager README](https://github.com/NVIDIA/vgpu-device-manager). - -By default, the GPU Operator deploys a ConfigMap for the vGPU Device Manager, containing named configurations for all [vGPU types supported by NVIDIA vGPU](https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#supported-gpus-grid-vgpu). -Users can select a specific configuration for a worker node by applying the `nvidia.com/vgpu.config` node label. -For example, labeling a node with `nvidia.com/vgpu.config=A10-8Q` would create three vGPU devices of type **A10-8Q** on all **A10** GPUs on the node. Note that three is the maximum number of **A10-8Q** devices that can be created per GPU. -If the node is not labeled, the `default` configuration will be applied. -The `default` configuration will create Q-series vGPU devices on all GPUs, where the amount of framebuffer memory per vGPU device is half the total GPU memory. -For example, the `default` configuration will create two **A10-12Q** devices on all **A10** GPUs. - -You can also create different vGPU Q profiles on the same GPU using vGPU Device Manager configuration. -For example, you can create a **A10-4Q** and a **A10-6Q** device on same GPU by creating a vGPU Device Manager configuration with the following content: - -```yaml -version: v1 -vgpu-configs: - custom-A10-config: - - devices: all - vgpu-devices: - "A10-4Q": 3 - "A10-6Q": 2 -``` - -If custom vGPU device configuration is desired, more than the default config map provides, you can create your own config map: - -```console -$ kubectl create configmap custom-vgpu-config -n gpu-operator --from-file=config.yaml=/path/to/file -``` - -And then configure the GPU Operator to use it by setting `vgpuDeviceManager.config.name=custom-vgpu-config`. - -### Apply a New vGPU Device Configuration - -You can apply a specific vGPU device configuration on a per-node basis by setting the `nvidia.com/vgpu.config` node label. -It is recommended to set this node label prior to installing the GPU Operator if you do not want the default configuration applied. - -Switching vGPU device configuration after one has been successfully applied assumes that no virtual machines with vGPU are currently running on the node. -Any existing virtual machines should be shutdown/migrated before you apply the new configuration. - -To apply a new configuration after GPU Operator install, update the `nvidia.com/vgpu.config` node label. - -> [!NOTE] -> On GPUs that support MIG, you have the option to select MIG-backed vGPU instances instead of time-sliced vGPU instances. -> To select a MIG-backed vGPU profile, label the node with the name of the MIG-backed vGPU profile. -> The following example shows how to apply a new configuration on a system with two **A10** GPUs. - -```console -$ nvidia-smi -L -GPU 0: NVIDIA A10 (UUID: GPU-ebd34bdf-1083-eaac-2aff-4b71a022f9bd) -GPU 1: NVIDIA A10 (UUID: GPU-1795e88b-3395-b27b-dad8-0488474eec0c) -``` - -In this example, the GPU Operator has been installed and the `nvidia.com/vgpu.config` was not added to worker nodes, meaning the `default` vGPU config got applied. -This resulted in the creation of four **A10-12Q** devices (two per GPU): - -```console -$ kubectl get node cnt-server-2 -o json | jq '.status.allocatable | with_entries(select(.key | startswith("nvidia.com/"))) | with_entries(select(.value != "0"))' -{ - "nvidia.com/NVIDIA_A10-12Q": "4" -} -``` - -Now if you wanted to create **A10-4Q** devices, add the `nvidia.com/vgpu.config` label to the node: - -```console -$ kubectl label node --overwrite nvidia.com/vgpu.config=A10-4Q -``` - -After the vGPU Device Manager finishes applying the new configuration, all GPU Operator pods should return to the Running state. - -```console -$ kubectl get pods -n gpu-operator -NAME READY STATUS RESTARTS AGE -... -nvidia-sandbox-device-plugin-daemonset-brtb6 1/1 Running 0 10s -nvidia-sandbox-validator-ljnwg 1/1 Running 0 10s -nvidia-vgpu-device-manager-8mgg8 1/1 Running 0 30m -nvidia-vgpu-manager-daemonset-fpplc 1/1 Running 0 31m -``` - -You can now see 12 **A10-4Q** devices on the node, as six **A10-4Q** devices can be created per **A10** GPU. - -```console -$ kubectl get node cnt-server-2 -o json | jq '.status.allocatable | with_entries(select(.key | startswith("nvidia.com/"))) | with_entries(select(.value != "0"))' -{ - "nvidia.com/NVIDIA_A10-4Q": "12" -} -``` - -## Building the NVIDIA vGPU Manager image - -> [!NOTE] -> Building the NVIDIA vGPU Manager image is only required if you are planning to use NVIDIA vGPU. -> If only planning to use PCI passthrough, skip this section. -> This section covers building the NVIDIA vGPU Manager container image and pushing it to a private registry. - -Download the vGPU Software from the [NVIDIA Licensing Portal](https://stg.ui.licensing.nvidia.com/). - -* Login to the NVIDIA Licensing Portal and navigate to the **Software Downloads** section. -* The NVIDIA vGPU Software is located in the **Software Downloads** section of the NVIDIA Licensing Portal. -* The vGPU Software bundle is packaged as a zip file. Download and unzip the bundle to obtain the NVIDIA vGPU Manager for Linux file, `NVIDIA-Linux-x86_64--vgpu-kvm.run`. - -Next, clone the driver container repository and build the driver image with the following steps. - -Open a terminal and clone the driver container image repository. - -```console -$ git clone https://github.com/NVIDIA/gpu-driver-container.git -$ cd gpu-driver-container -``` - -1. Copy the NVIDIA vGPU manager from your extracted ZIP file to the operating system version you want to build the image for: - * We use Ubuntu 22.04 as an example. +- Starting with KubeVirt v0.58.2 and v0.59.1, set the `DisableMDEVConfiguration` feature gate (the exact `kubectl patch` command is in [references/configure-and-install.md](references/configure-and-install.md)). - Copy `/\*-vgpu-kvm.run` to `vgpu-manager/ubuntu22.04/`. +## Activation - ```console - $ cp /*-vgpu-kvm.run vgpu-manager/ubuntu22.04/ - ``` +Do this first: identify which phase the user's request maps to in the Phases +table below, then **read the corresponding `references/.md` file before +acting**. All Helm/kubectl command sequences, KubeVirt CR manifests, VM +manifests, and image-build steps live only in those reference files — do not +improvise commands from this dispatch layer. -> [!NOTE] -> For Red Hat OpenShift, use a directory that includes `rhel` in the directory name. For example, `vgpu-manager/rhel8`. -| Set the following environment variables: -| `PRIVATE_REGISTRY` - name of private registry used to store driver image -| `VGPU_HOST_DRIVER_VERSION` - NVIDIA vGPU Manager version downloaded from NVIDIA Software Portal -| `OS_TAG` - this must match the Guest OS version. In the following example `ubuntu22.04` is used. For Red Hat OpenShift this should be set to `rhcos4.x` where x is the supported minor OCP version. +## Phases -```console -$ export PRIVATE_REGISTRY=my/private/registry VGPU_HOST_DRIVER_VERSION=580.82.07 OS_TAG=ubuntu22.04 -``` +| Phase | Summary | Reference | +|-------|---------|-----------| +| Concepts | What KubeVirt is, per-node component split by `nvidia.com/gpu.workload.config` (`container`/`vm-passthrough`/`vm-vgpu`), assumptions/constraints, and the high-level workflow. | [references/concepts.md](references/concepts.md) | +| Configure & install | Label worker nodes, install the GPU Operator with `sandboxWorkloads.enabled` (with or without vGPU), add vGPU or GPU-passthrough resources to the KubeVirt CR, and create a VM with a GPU. | [references/configure-and-install.md](references/configure-and-install.md) | +| vGPU device config | Use the vGPU Device Manager ConfigMap and the `nvidia.com/vgpu.config` node label to declaratively create and switch vGPU device profiles. | [references/vgpu-device-config.md](references/vgpu-device-config.md) | +| Build vGPU Manager image | Download the vGPU software, clone the driver-container repo, and build/push the private NVIDIA vGPU Manager image (required only for vGPU). | [references/build-vgpu-manager.md](references/build-vgpu-manager.md) | -Build the NVIDIA vGPU Manager image. +## Hard rules (apply across all phases) -```console -$ VGPU_HOST_DRIVER_VERSION=${VGPU_HOST_DRIVER_VERSION} IMAGE_NAME=${PRIVATE_REGISTRY}/vgpu-manager make build-vgpuhost-${OS_TAG} -``` +- A GPU worker node runs exactly one workload type (`container`, `vm-passthrough`, or `vm-vgpu`) — never a combination. +- You must manually add all passthrough/vGPU resources to the KubeVirt CR `permittedHostDevices` list before assigning them to VMs. +- The GPU Operator does NOT install NVIDIA drivers inside the guest VMs; that is the user's responsibility. +- Building the vGPU Manager image is required only for vGPU; skip it for PCI passthrough only. +- Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). Never hardcode a version. -Push NVIDIA vGPU Manager image to your private registry. +## Verification -```console -$ VGPU_HOST_DRIVER_VERSION=${VGPU_HOST_DRIVER_VERSION} IMAGE_NAME=${PRIVATE_REGISTRY}/vgpu-manager make push-vgpuhost-${OS_TAG} -``` +After install and CR configuration, confirm the `sandbox-device-plugin` pod is +`Running` and the expected `nvidia.com/...` GPU/vGPU resources appear in the +node's allocatable resources before assigning them to a VM. Exact commands are +in [references/configure-and-install.md](references/configure-and-install.md) +and [references/vgpu-device-config.md](references/vgpu-device-config.md). diff --git a/gpu-operator/.agents/skills/gpu-operator-kubevirt/references/build-vgpu-manager.md b/gpu-operator/.agents/skills/gpu-operator-kubevirt/references/build-vgpu-manager.md new file mode 100644 index 000000000..2d2211df2 --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-kubevirt/references/build-vgpu-manager.md @@ -0,0 +1,60 @@ + + + +# Building the NVIDIA vGPU Manager image + +> [!NOTE] +> Building the NVIDIA vGPU Manager image is only required if you are planning to use NVIDIA vGPU. +> If only planning to use PCI passthrough, skip this section. +> This section covers building the NVIDIA vGPU Manager container image and pushing it to a private registry. + +Download the vGPU Software from the [NVIDIA Licensing Portal](https://stg.ui.licensing.nvidia.com/). + +* Login to the NVIDIA Licensing Portal and navigate to the **Software Downloads** section. +* The NVIDIA vGPU Software is located in the **Software Downloads** section of the NVIDIA Licensing Portal. +* The vGPU Software bundle is packaged as a zip file. Download and unzip the bundle to obtain the NVIDIA vGPU Manager for Linux file, `NVIDIA-Linux-x86_64--vgpu-kvm.run`. + +Next, clone the driver container repository and build the driver image with the following steps. + +Open a terminal and clone the driver container image repository. + +```console +$ git clone https://github.com/NVIDIA/gpu-driver-container.git +$ cd gpu-driver-container +``` + +1. Copy the NVIDIA vGPU manager from your extracted ZIP file to the operating system version you want to build the image for: + * We use Ubuntu 22.04 as an example. + + Copy `/\*-vgpu-kvm.run` to `vgpu-manager/ubuntu22.04/`. + + ```console + $ cp /*-vgpu-kvm.run vgpu-manager/ubuntu22.04/ + ``` + +> [!NOTE] +> For Red Hat OpenShift, use a directory that includes `rhel` in the directory name. For example, `vgpu-manager/rhel8`. + +Set the following environment variables: + +| Variable | Description | +| --- | --- | +| `PRIVATE_REGISTRY` | name of private registry used to store driver image | +| `VGPU_HOST_DRIVER_VERSION` | NVIDIA vGPU Manager version downloaded from NVIDIA Software Portal | +| `OS_TAG` | this must match the Guest OS version. In the following example `ubuntu22.04` is used. For Red Hat OpenShift this should be set to `rhcos4.x` where x is the supported minor OCP version. | + +```console +$ export PRIVATE_REGISTRY=my/private/registry VGPU_HOST_DRIVER_VERSION=580.82.07 OS_TAG=ubuntu22.04 +``` + +Build the NVIDIA vGPU Manager image. + +```console +$ VGPU_HOST_DRIVER_VERSION=${VGPU_HOST_DRIVER_VERSION} IMAGE_NAME=${PRIVATE_REGISTRY}/vgpu-manager make build-vgpuhost-${OS_TAG} +``` + +Push NVIDIA vGPU Manager image to your private registry. + +```console +$ VGPU_HOST_DRIVER_VERSION=${VGPU_HOST_DRIVER_VERSION} IMAGE_NAME=${PRIVATE_REGISTRY}/vgpu-manager make push-vgpuhost-${OS_TAG} +``` diff --git a/gpu-operator/.agents/skills/gpu-operator-kubevirt/references/concepts.md b/gpu-operator/.agents/skills/gpu-operator-kubevirt/references/concepts.md new file mode 100644 index 000000000..5f30f0312 --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-kubevirt/references/concepts.md @@ -0,0 +1,81 @@ + + + +# KubeVirt with the GPU Operator: Concepts + +## About the Operator with KubeVirt + +[KubeVirt](https://kubevirt.io/) is a virtual machine management add-on to Kubernetes that allows you to run and manage virtual machines in a Kubernetes cluster. +It eliminates the need to manage separate clusters for virtual machine and container workloads because both can now coexist in a single Kubernetes cluster. + +In addition to the GPU Operator being able to provision worker nodes for running GPU-accelerated containers, the GPU Operator can also be used to provision worker nodes for running GPU-accelerated virtual machines with KubeVirt. + +There are some different prerequisites required when running virtual machines with GPUs compared to running containers with GPUs. +The primary difference is the drivers required. +For example, the datacenter driver is needed for containers, the vfio-pci driver is needed for GPU passthrough, and the [NVIDIA vGPU Manager](https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#installing-configuring-grid-vgpu) is needed for creating vGPU devices. + +## Configure Worker Nodes for GPU Operator components + +The GPU Operator can now be configured to deploy different software components on worker nodes depending on what GPU workload is configured to run on those nodes. +This is configured by adding a `nvidia.com/gpu.workload.config` label to the worker node with the value of `container`, `vm-passthrough`, or `vm-vgpu` depending on if you are planning to use vGPU or not. +The GPU Operator will use the label to determine which software components to deploy on the worker nodes. + +Given the following node configuration: + +* Node A is configured with the label `nvidia.com/gpu.workload.config=container` and configured to run containers. +* Node B is configured with the label `nvidia.com/gpu.workload.config=vm-passthrough` and configured to run virtual machines with Passthrough GPU. +* Node C is configured with the label `nvidia.com/gpu.workload.config=vm-vgpu` and configured to run virtual machines with vGPU. + +The GPU Operator will deploy the following software components on each node: + +* Node A receives the following software components: + * `NVIDIA Datacenter Driver` - to install the driver + * `NVIDIA Container Toolkit` - to ensure containers can properly access GPUs + * `NVIDIA Kubernetes Device Plugin` - to discover and advertise GPU resources to kubelet + * `NVIDIA DCGM and DCGM Exporter` - to monitor the GPU(s) + +* Node B receives the following software components: + * `VFIO Manager` - to load `vfio-pci` and bind it to all GPUs on the node + * `Sandbox Device Plugin` - to discover and advertise the passthrough GPUs to kubelet + +* Node C receives the following software components: + * `NVIDIA vGPU Manager` - to install the driver + * `NVIDIA vGPU Device Manager` - to create vGPU devices on the node + * `Sandbox Device Plugin` - to discover and advertise the vGPU devices to kubelet + +If the node label `nvidia.com/gpu.workload.config` does not exist on the node, the GPU Operator will assume the default GPU workload configuration, `container`, and will deploy the software components needed to support this workload type. +To override the default GPU workload configuration, set the following value in `ClusterPolicy`: `sandboxWorkloads.defaultWorkload=`. + +## Assumptions, constraints, and dependencies + +* A GPU worker node can run GPU workloads of a particular type, such as containers, virtual machines with GPU Passthrough, or virtual machines with vGPU, but not a combination of any of them. + +* The cluster admin or developer has knowledge about their cluster ahead of time and can properly label nodes to indicate what types of GPU workloads they will run. + +* Worker nodes running GPU accelerated virtual machines (with GPU passthrough or vGPU) are assumed to be bare metal. + +* The GPU Operator will not automate the installation of NVIDIA drivers inside KubeVirt virtual machines with GPUs/vGPUs attached. + +* Users must manually add all passthrough GPU and vGPU resources to the `permittedDevices` list in the KubeVirt CR before assigning them to KubeVirt virtual machines. Refer to the [KubeVirt documentation](https://kubevirt.io/user-guide/compute/host-devices/#listing-permitted-devices) for more information. + +## Workflow Overview + +After configuring the prerequisites, the high level workflow for using the GPU Operator with KubeVirt is as follows: + +* Label worker nodes based on the GPU workloads they will run. +* Install the GPU Operator and set `sandboxWorkloads.enabled=true` + +If you are planning to deploy VMs with vGPU, the workflow is as follows: + +* Build the NVIDIA vGPU Manager image (see [references/build-vgpu-manager.md](build-vgpu-manager.md)) +* Label the node for the vGPU configuration +* Add vGPU resources to KubeVirt CR +* Create a virtual machine with vGPU + +If you are planning to deploy VMs with GPU passthrough, the workflow is as follows: + +* Add GPU passthrough resources to KubeVirt CR +* Create a virtual machine with GPU passthrough + +The term *sandboxing* refers to running software in a separate isolated environment, typically for added security (that is, a virtual machine). +We use the term `sandbox workloads` to signify workloads that run in a virtual machine, irrespective of the virtualization technology used. diff --git a/gpu-operator/.agents/skills/gpu-operator-kubevirt/references/configure-and-install.md b/gpu-operator/.agents/skills/gpu-operator-kubevirt/references/configure-and-install.md new file mode 100644 index 000000000..ac10997ad --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-kubevirt/references/configure-and-install.md @@ -0,0 +1,259 @@ + + + +# Configure KubeVirt with the GPU Operator + +Throughout, replace `` with your target GPU Operator release. + +## Label worker nodes + +The GPU Operator uses the value of the `nvidia.com/gpu.workload.config` label to determine which operands to deploy on your worker node. + +1. Add a `nvidia.com/gpu.workload.config` label to a worker node: + + ```console + $ kubectl label node --overwrite nvidia.com/gpu.workload.config=vm-vgpu + ``` + + You can assign the following values to the label: + + * `container` + * `vm-passthrough` + * `vm-vgpu` + + Refer to the Configure Worker Nodes for GPU Operator components section for more information on the different configurations options. + +## Install the GPU Operator + +Follow one of the below subsections for installing the GPU Operator, depending on whether you plan to use NVIDIA vGPU or not. + +> [!NOTE] +> The following commands set the `sandboxWorkloads.enabled` flag. +> This `ClusterPolicy` flag controls whether the GPU Operator can provision GPU worker nodes for virtual machine workloads, in addition to container workloads. +> This flag is disabled by default, meaning all nodes get provisioned with the same software to enable container workloads, and the `nvidia.com/gpu.workload.config` node label is not used. + +### Install the GPU Operator without NVIDIA vGPU + +> [!NOTE] +> Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). + +Install the GPU Operator, enabling `sandboxWorkloads`: + +```console +$ helm install --wait --generate-name \ + -n gpu-operator --create-namespace \ + nvidia/gpu-operator \ + --version= \ + --set sandboxWorkloads.enabled=true +``` + +### Install the GPU Operator with NVIDIA vGPU + +Before installing the GPU Operator with NVIDIA vGPU, you must build a private NVIDIA vGPU Manager container image and push to a private registry (see [references/build-vgpu-manager.md](build-vgpu-manager.md)). +Follow the steps provided in this section. + +1. Create a namespace for GPU Operator: + + ```console + $ kubectl create namespace gpu-operator + ``` + +1. Create an ImagePullSecret for accessing the NVIDIA vGPU Manager image: + + ```console + $ kubectl create secret docker-registry ${REGISTRY_SECRET_NAME} \ + --docker-server=${PRIVATE_REGISTRY} --docker-username= \ + --docker-password= \ + --docker-email= -n gpu-operator + ``` + +1. Install the GPU Operator with `sandboxWorkloads` and `vgpuManager` enabled and specify the NVIDIA vGPU Manager image built previously: + + ```console + $ helm install --wait --generate-name \ + -n gpu-operator --create-namespace \ + nvidia/gpu-operator \ + --version= \ + --set sandboxWorkloads.enabled=true \ + --set vgpuManager.enabled=true \ + --set vgpuManager.repository= \ + --set vgpuManager.image=vgpu-manager \ + --set vgpuManager.version= \ + --set vgpuManager.imagePullSecrets={${REGISTRY_SECRET_NAME}} + ``` + +The vGPU Device Manager, deployed by the GPU Operator, automatically creates vGPU devices that can be assigned to KubeVirt virtual machines. +Without additional configuration, the GPU Operator creates a default set of devices on all GPUs. +To learn more about the vGPU Device Manager and configure which types of vGPU devices get created in your cluster, refer to vGPU Device Configuration (see [references/vgpu-device-config.md](vgpu-device-config.md)). + +## Add GPU resources to KubeVirt CR + +Follow one of the below subsections for adding GPU resources to the KubeVirt CR, depending on whether you plan to use NVIDIA vGPU or not. + +### Add vGPU resources to KubeVirt CR + +Update the KubeVirt custom resource so that all vGPU devices in your cluster are permitted and can be assigned to virtual machines. + +The following example shows how to permit the A10-12Q vGPU device, the device names for the GPUs on your cluster will likely be different. + +1. Determine the resource names for the GPU devices: + + ```console + $ kubectl get node cnt-server-2 -o json | jq '.status.allocatable | with_entries(select(.key | startswith("nvidia.com/"))) | with_entries(select(.value != "0"))' + ``` + + *Example Output* + + ```output + { + "nvidia.com/NVIDIA_A10-12Q": "4" + } + ``` + +1. Determine the PCI device IDs for the GPUs. + + * You can search by device name in the [PCI IDs database](https://pci-ids.ucw.cz/v2.2/pci.ids). + + * If you have host access to the node, you can list the NVIDIA GPU devices with a command like the following example: + + ```console + $ lspci -nnk -d 10de: + ``` + + *Example Output* + + ```output + 65:00.0 3D controller [0302]: NVIDIA Corporation GA102GL [A10] [10de:2236] (rev a1) + Subsystem: NVIDIA Corporation GA102GL [A10] [10de:1482] + Kernel modules: nvidiafb, nouveau + ``` + +1. Modify the `KubeVirt` custom resource like the following partial example. + + ```yaml + ... + spec: + configuration: + developerConfiguration: + featureGates: + - GPU + - DisableMDEVConfiguration + permittedHostDevices: # Defines VM devices to import. + mediatedDevices: # Include for vGPU + - externalResourceProvider: true + mdevNameSelector: NVIDIA A10-12Q + resourceName: nvidia.com/NVIDIA_A10-12Q + ... + ``` + + Replace the values in the YAML as follows: + + * `mdevNameSelector` and `resourceName` under `mediatedDevices` to correspond to your vGPU type. + + * Set `externalResourceProvider=true` to indicate that this resource is provided by an external device plugin, in this case the `sandbox-device-plugin` that is deployed by the GPU Operator. + +Refer to the [KubeVirt user guide](https://kubevirt.io/user-guide/virtual_machines/host-devices/#listing-permitted-devices) for more information on the configuration options. + +### Add GPU passthrough resources to KubeVirt CR + +Update the KubeVirt custom resource so that all GPU passthrough devices in your cluster are permitted and can be assigned to virtual machines. + +The following example shows how to permit the A10 GPU device, the device names for the GPUs on your cluster will likely be different. + +1. Determine the resource names for the GPU devices: + + ```console + $ kubectl get node cnt-server-2 -o json | jq '.status.allocatable | with_entries(select(.key | startswith("nvidia.com/"))) | with_entries(select(.value != "0"))' + ``` + + *Example Output* + + ```output + { + "nvidia.com/GA102GL_A10": "1" + } + ``` + +1. Determine the PCI device IDs for the GPUs. + + * You can search by device name in the [PCI IDs database](https://pci-ids.ucw.cz/v2.2/pci.ids). + + * If you have host access to the node, you can list the NVIDIA GPU devices with a command like the following example: + + ```console + $ lspci -nnk -d 10de: + ``` + + *Example Output* + + ```output + 65:00.0 3D controller [0302]: NVIDIA Corporation GA102GL [A10] [10de:2236] (rev a1) + Subsystem: NVIDIA Corporation GA102GL [A10] [10de:1482] + Kernel modules: nvidiafb, nouveau + ``` + +1. Modify the `KubeVirt` custom resource like the following partial example. + + ```yaml + ... + spec: + configuration: + developerConfiguration: + featureGates: + - GPU + - DisableMDEVConfiguration + permittedHostDevices: # Defines VM devices to import. + pciHostDevices: # Include for GPU passthrough + - externalResourceProvider: true + pciVendorSelector: 10DE:2236 + resourceName: nvidia.com/GA102GL_A10 + ... + ``` + + Replace the values in the YAML as follows: + + * `pciVendorSelector` and `resourceName` under `pciHostDevices` to correspond to your GPU model. + + * Set `externalResourceProvider=true` to indicate that this resource is provided by an external device plugin, in this case the `sandbox-device-plugin` that is deployed by the GPU Operator. + +Refer to the [KubeVirt user guide](https://kubevirt.io/user-guide/virtual_machines/host-devices/#listing-permitted-devices) for more information on the configuration options. + +## Create a virtual machine with GPU + +After the `sandbox-device-plugin` pod is running on your worker nodes and the GPU resources have been added to the +KubeVirt allowlist, you can assign a GPU to a virtual machine by editing the `spec.domain.devices.gpus` field +in the `VirtualMachineInstance` manifest. + +Example for GPU passthrough: + +```yaml +apiVersion: kubevirt.io/v1alpha3 +kind: VirtualMachineInstance +... +spec: + domain: + devices: + gpus: + - deviceName: nvidia.com/GA102GL_A10 + name: gpu1 +... +``` + +Example for vGPU: + +```yaml +apiVersion: kubevirt.io/v1alpha3 +kind: VirtualMachineInstance +... +spec: + domain: + devices: + gpus: + - deviceName: nvidia.com/NVIDIA_A10-12Q + name: gpu1 +... +``` + +* `deviceName` is the resource name representing the device. + +* `name` is a name to identify the device in the virtual machine diff --git a/gpu-operator/.agents/skills/gpu-operator-kubevirt/references/vgpu-device-config.md b/gpu-operator/.agents/skills/gpu-operator-kubevirt/references/vgpu-device-config.md new file mode 100644 index 000000000..f1aa86068 --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-kubevirt/references/vgpu-device-config.md @@ -0,0 +1,99 @@ + + + +# vGPU Device Configuration + +The vGPU Device Manager assists in creating vGPU devices on GPU worker nodes. +The vGPU Device Manager allows administrators to declaratively define a set of possible vGPU device configurations they would like applied to GPUs on a node. +At runtime, adminstrators then point the vGPU Device Manager at one of these configurations, and vGPU Device Manager takes care of applying it. + +The configuration file is created as a ConfigMap, and is shared across all worker nodes. +At runtime, a node label, `nvidia.com/vgpu.config`, can be used to decide which of these configurations to actually apply to a node at any given time. +If the node is not labeled, then the `default` configuration will be used. +For more information on this component and how it is configured, refer to the [NVIDIA vGPU Device Manager README](https://github.com/NVIDIA/vgpu-device-manager). + +By default, the GPU Operator deploys a ConfigMap for the vGPU Device Manager, containing named configurations for all [vGPU types supported by NVIDIA vGPU](https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#supported-gpus-grid-vgpu). +Users can select a specific configuration for a worker node by applying the `nvidia.com/vgpu.config` node label. +For example, labeling a node with `nvidia.com/vgpu.config=A10-8Q` would create three vGPU devices of type **A10-8Q** on all **A10** GPUs on the node. Note that three is the maximum number of **A10-8Q** devices that can be created per GPU. +If the node is not labeled, the `default` configuration will be applied. +The `default` configuration will create Q-series vGPU devices on all GPUs, where the amount of framebuffer memory per vGPU device is half the total GPU memory. +For example, the `default` configuration will create two **A10-12Q** devices on all **A10** GPUs. + +You can also create different vGPU Q profiles on the same GPU using vGPU Device Manager configuration. +For example, you can create a **A10-4Q** and a **A10-6Q** device on same GPU by creating a vGPU Device Manager configuration with the following content: + +```yaml +version: v1 +vgpu-configs: + custom-A10-config: + - devices: all + vgpu-devices: + "A10-4Q": 3 + "A10-6Q": 2 +``` + +If custom vGPU device configuration is desired, more than the default config map provides, you can create your own config map: + +```console +$ kubectl create configmap custom-vgpu-config -n gpu-operator --from-file=config.yaml=/path/to/file +``` + +And then configure the GPU Operator to use it by setting `vgpuDeviceManager.config.name=custom-vgpu-config`. + +## Apply a New vGPU Device Configuration + +You can apply a specific vGPU device configuration on a per-node basis by setting the `nvidia.com/vgpu.config` node label. +It is recommended to set this node label prior to installing the GPU Operator if you do not want the default configuration applied. + +Switching vGPU device configuration after one has been successfully applied assumes that no virtual machines with vGPU are currently running on the node. +Any existing virtual machines should be shutdown/migrated before you apply the new configuration. + +To apply a new configuration after GPU Operator install, update the `nvidia.com/vgpu.config` node label. + +> [!NOTE] +> On GPUs that support MIG, you have the option to select MIG-backed vGPU instances instead of time-sliced vGPU instances. +> To select a MIG-backed vGPU profile, label the node with the name of the MIG-backed vGPU profile. +> The following example shows how to apply a new configuration on a system with two **A10** GPUs. + +```console +$ nvidia-smi -L +GPU 0: NVIDIA A10 (UUID: GPU-ebd34bdf-1083-eaac-2aff-4b71a022f9bd) +GPU 1: NVIDIA A10 (UUID: GPU-1795e88b-3395-b27b-dad8-0488474eec0c) +``` + +In this example, the GPU Operator has been installed and the `nvidia.com/vgpu.config` was not added to worker nodes, meaning the `default` vGPU config got applied. +This resulted in the creation of four **A10-12Q** devices (two per GPU): + +```console +$ kubectl get node cnt-server-2 -o json | jq '.status.allocatable | with_entries(select(.key | startswith("nvidia.com/"))) | with_entries(select(.value != "0"))' +{ + "nvidia.com/NVIDIA_A10-12Q": "4" +} +``` + +Now if you wanted to create **A10-4Q** devices, add the `nvidia.com/vgpu.config` label to the node: + +```console +$ kubectl label node --overwrite nvidia.com/vgpu.config=A10-4Q +``` + +After the vGPU Device Manager finishes applying the new configuration, all GPU Operator pods should return to the Running state. + +```console +$ kubectl get pods -n gpu-operator +NAME READY STATUS RESTARTS AGE +... +nvidia-sandbox-device-plugin-daemonset-brtb6 1/1 Running 0 10s +nvidia-sandbox-validator-ljnwg 1/1 Running 0 10s +nvidia-vgpu-device-manager-8mgg8 1/1 Running 0 30m +nvidia-vgpu-manager-daemonset-fpplc 1/1 Running 0 31m +``` + +You can now see 12 **A10-4Q** devices on the node, as six **A10-4Q** devices can be created per **A10** GPU. + +```console +$ kubectl get node cnt-server-2 -o json | jq '.status.allocatable | with_entries(select(.key | startswith("nvidia.com/"))) | with_entries(select(.value != "0"))' +{ + "nvidia.com/NVIDIA_A10-4Q": "12" +} +``` diff --git a/gpu-operator/.agents/skills/gpu-operator-multiinstance/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-multiinstance/SKILL.md index 0ebff5f33..1c44d65c6 100644 --- a/gpu-operator/.agents/skills/gpu-operator-multiinstance/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-multiinstance/SKILL.md @@ -20,541 +20,43 @@ tags: # GPU Operator with MIG +Partition MIG-capable NVIDIA GPUs into separate, secure GPU instances using the +GPU Operator's MIG Manager: choose a MIG strategy, enable MIG at install, label +nodes with MIG profiles, and reconfigure or disable MIG dynamically. + ## Prerequisites - A running Kubernetes cluster with NVIDIA GPU worker nodes. - The NVIDIA GPU Operator installed (use the `gpu-operator-install` skill). - One or more MIG-capable NVIDIA GPUs (such as A100, A30, H100, or H200). The MIG Manager runs by default only on nodes with GPUs that support MIG. -## About Multi-Instance GPU - -Multi-Instance GPU (MIG) enables GPUs based on the NVIDIA Ampere and later architectures, such as NVIDIA A100, to be partitioned into separate and secure GPU instances for CUDA applications. -Refer to the [MIG User Guide](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html) for more information about MIG. - -GPU Operator deploys MIG Manager to manage MIG configuration on nodes in your Kubernetes cluster. -You must enable MIG during installation by choosing a MIG strategy before you can configure MIG. - -Refer to the [Multi-Instance GPU architecture](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-mig.html) for more information about how MIG is implemented in the GPU Operator. - -## Enabling MIG During Installation - -Use the following steps to enable MIG and deploy MIG Manager. - -> [!NOTE] -> Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). - -1. Install the Operator: - - ```console - $ helm install --wait --generate-name \ - -n gpu-operator --create-namespace \ - nvidia/gpu-operator \ - --version= \ - --set mig.strategy=single - ``` - - This example sets `single` as the MIG strategy. - Available MIG strategy options: - - * `single`: MIG mode is enabled on all GPUs on a node. - * `mixed`: MIG mode is not enabled on all GPUs on a node. - - In a cloud service provider (CSP) environment such as Google Cloud, also specify - `--set migManager.env[0].name=WITH_REBOOT --set-string migManager.env[0].value=true` - to ensure that the node reboots and can apply the MIG configuration. - - MIG Manager supports preinstalled drivers, meaning drivers that are not managed by the GPU Operator and you installed directly on the host. - If drivers are preinstalled, also specify `--set driver.enabled=false`. - Refer to [MIG with pre-installed drivers](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-mig.html) for more details. - - After several minutes, all GPU Operator pods, including the `nvidia-mig-manager` are deployed on nodes that have MIG capable GPUs. - - > [!NOTE] - > MIG Manager requires that no user workloads are running on the GPUs being configured. - > In some cases, the node might need to be rebooted, such as a CSP, so the node might need to be cordoned - > before changing the MIG mode or the MIG geometry on the GPUs. - - 1. Optional: Display the pods in the Operator namespace: - - ```console - $ kubectl get pods -n gpu-operator - ``` - - *Example Output* - -1. Optional: Display the labels applied to the node: - - ```console - $ kubectl get node -o json | jq '.items[].metadata.labels' - ``` - - *Partial Output* - -## Configuring MIG Profiles - -When MIG is enabled, nodes are labeled with `nvidia.com/mig.config: all-disabled` by default. -To use a profile on a node, update the label value with the desired profile, for example, `nvidia.com/mig.config=all-1g.10gb`. - -Introduced in GPU Operator v26.3.0, MIG Manager generates the MIG configuration for a node at runtime from the available hardware. -The configuration is generated on startup, discovering MIG profiles for each MIG-capable GPU on a node using [NVIDIA Management Library (NVML)](https://developer.nvidia.com/management-library-nvml), then writing it to a ConfigMap for each MIG-capable node in your cluster. -The ConfigMap is named `-mig-config`, where `` is the name of each MIG-capable node. -Each ConfigMap contains a complete mig-parted config, including `all-disabled`, `all-enabled`, per-profile configs such as `all-1g.10gb`, and `all-balanced` with device-filter support for mixed GPU types. -When a new MIG-capable GPU is added to a node, the new GPU is automatically added to the ConfigMap. - -If you need custom profiles, you can use a custom MIG configuration instead of the generated one. -You can use the Helm chart to create a ConfigMap from values at install time, or create and reference your own ConfigMap. -For an example, refer to dynamically-creating-the-mig-configuration-configmap. - -> [!NOTE] -> Generated MIG configuration might not be available on older drivers, such as 535 branch GPU drivers, as they do not support querying MIG profiles when MIG mode is disabled. In those cases, the GPU Operator will use a [static Configmap](https://github.com/NVIDIA/gpu-operator/blob/main/assets/state-mig-manager/0400_configmap.yaml), `default-mig-parted-config`, for MIG profiles. -### Example: Single MIG Strategy - -The following steps show how to use the single MIG strategy and configure the `1g.10gb` profile on one node. - -1. Configure the MIG strategy to `single` if you are unsure of the current strategy: - - ```console - $ kubectl patch clusterpolicies.nvidia.com/cluster-policy \ - --type='json' \ - -p='[{"op":"replace", "path":"/spec/mig/strategy", "value":"single"}]' - ``` - -1. Label the nodes with the profile to configure: - - ```console - $ kubectl label nodes nvidia.com/mig.config=all-1g.10gb --overwrite - ``` - - MIG Manager proceeds to apply a `mig.config.state` label to the node and terminates all - the GPU pods in preparation to enable MIG mode and configure the GPU into the desired MIG geometry. - -1. Optional: Display the node labels: - - ```console - $ kubectl get node -o=jsonpath='{.metadata.labels}' | jq . - ``` - - *Partial Output* - - ```json - "nvidia.com/gpu.product": "NVIDIA-H100-80GB-HBM3", - "nvidia.com/gpu.replicas": "1", - "nvidia.com/gpu.sharing-strategy": "none", - "nvidia.com/mig.capable": "true", - "nvidia.com/mig.config": "all-1g.10gb", - "nvidia.com/mig.config.state": "pending", - "nvidia.com/mig.strategy": "single" - } - ``` - - When the `WITH_REBOOT` option is set, MIG Manager sets the label to `nvidia.com/mig.config.state: rebooting`. - -1. Confirm that MIG Manager completed the configuration by checking the node labels: - - ```console - $ kubectl get node -o=jsonpath='{.metadata.labels}' | jq . - ``` - - Check for the following labels: - - * `nvidia.com/gpu.count: 7` (the value differs according to the GPU model) - * `nvidia.com/gpu.slices.ci: 1` - * `nvidia.com/gpu.slices.gi: 1` - * `nvidia.com/mig.config.state: success` - - *Partial Output* - - ```json - "nvidia.com/gpu.count": "7", - "nvidia.com/gpu.present": "true", - "nvidia.com/gpu.product": "NVIDIA-H100-80GB-HBM3-MIG-1g.10gb", - "nvidia.com/gpu.slices.ci": "1", - "nvidia.com/gpu.slices.gi": "1", - "nvidia.com/mig.capable": "true", - "nvidia.com/mig.config": "all-1g.10gb", - "nvidia.com/mig.config.state": "success", - "nvidia.com/mig.strategy": "single" - ``` - -1. Optional: Run the `nvidia-smi` command in the driver container to verify that the MIG configuration has been applied. - - ```console - $ kubectl exec -it -n gpu-operator ds/nvidia-driver-daemonset -- nvidia-smi -L - ``` - - *Example Output* - -### Example: Mixed MIG Strategy - -The following steps show how to use the `mixed` MIG strategy and configure the `all-balanced` profile on one node. - -1. Configure the MIG strategy to `mixed` if you are unsure of the current strategy: - - ```console - $ kubectl patch clusterpolicies.nvidia.com/cluster-policy \ - --type='json' \ - -p='[{"op":"replace", "path":"/spec/mig/strategy", "value":"mixed"}]' - ``` - -1. Label the nodes with the profile to configure: - - ```console - $ kubectl label nodes nvidia.com/mig.config=all-balanced --overwrite - ``` - - MIG Manager proceeds to apply a `mig.config.state` label to the node and terminates all - the GPU pods in preparation to enable MIG mode and configure the GPU into the desired MIG geometry. - -1. Confirm that MIG Manager completed the configuration by checking the node labels: - - ```console - $ kubectl get node -o=jsonpath='{.metadata.labels}' | jq . - ``` - - Check for labels like the following. - The profiles and GPU counts differ according to the GPU model. - - * `nvidia.com/mig-1g.10gb.count: 2` - * `nvidia.com/mig-2g.20gb.count: 1` - * `nvidia.com/mig-3g.40gb.count: 1` - * `nvidia.com/mig.config.state: success` - - *Partial Output* - -1. Optional: Run the `nvidia-smi` command in the driver container to verify that the GPU has been configured. - - ```console - $ kubectl exec -it -n gpu-operator ds/nvidia-driver-daemonset -- nvidia-smi -L - ``` - - *Example Output* - -### Example: Reconfiguring MIG Profiles - -MIG Manager supports dynamic reconfiguration of the MIG geometry. -The following steps show how to update a GPU on a node to the `3g.40gb` profile with the single MIG strategy. - -1. Label the node with the profile: - - ```console - $ kubectl label nodes nvidia.com/mig.config=all-3g.40gb --overwrite - ``` - -1. Optional: Monitor the MIG Manager logs to confirm the new MIG geometry is applied: - - ```console - $ kubectl logs -n gpu-operator -l app=nvidia-mig-manager -c nvidia-mig-manager - ``` - - *Example Output* - - ```console - Applying the selected MIG config to the node - time="2024-05-14T18:31:26Z" level=debug msg="Parsing config file..." - time="2024-05-14T18:31:26Z" level=debug msg="Selecting specific MIG config..." - time="2024-05-14T18:31:26Z" level=debug msg="Running apply-start hook" - time="2024-05-14T18:31:26Z" level=debug msg="Checking current MIG mode..." - time="2024-05-14T18:31:26Z" level=debug msg="Walking MigConfig for (devices=all)" - time="2024-05-14T18:31:26Z" level=debug msg=" GPU 0: 0x233010DE" - time="2024-05-14T18:31:26Z" level=debug msg=" Asserting MIG mode: Enabled" - time="2024-05-14T18:31:26Z" level=debug msg=" MIG capable: true\n" - time="2024-05-14T18:31:26Z" level=debug msg=" Current MIG mode: Enabled" - time="2024-05-14T18:31:26Z" level=debug msg="Checking current MIG device configuration..." - time="2024-05-14T18:31:26Z" level=debug msg="Walking MigConfig for (devices=all)" - time="2024-05-14T18:31:26Z" level=debug msg=" GPU 0: 0x233010DE" - time="2024-05-14T18:31:26Z" level=debug msg=" Asserting MIG config: map[3g.40gb:2]" - time="2024-05-14T18:31:26Z" level=debug msg="Running pre-apply-config hook" - time="2024-05-14T18:31:26Z" level=debug msg="Applying MIG device configuration..." - time="2024-05-14T18:31:26Z" level=debug msg="Walking MigConfig for (devices=all)" - time="2024-05-14T18:31:26Z" level=debug msg=" GPU 0: 0x233010DE" - time="2024-05-14T18:31:26Z" level=debug msg=" MIG capable: true\n" - time="2024-05-14T18:31:26Z" level=debug msg=" Updating MIG config: map[3g.40gb:2]" - MIG configuration applied successfully - time="2024-05-14T18:31:27Z" level=debug msg="Running apply-exit hook" - Restarting validator pod to re-run all validations - pod "nvidia-operator-validator-kmncw" deleted - Restarting all GPU clients previously shutdown in Kubernetes by reenabling their component-specific nodeSelector labels - node/node-name labeled - Changing the 'nvidia.com/mig.config.state' node label to 'success' - ``` - -1. Optional: Display the node labels to confirm the GPU count (`2`), slices (`3`), and profile are set: - - ```console - $ kubectl get node -o=jsonpath='{.metadata.labels}' | jq . - ``` - - *Partial Output* - - ```json - "nvidia.com/gpu.count": "2", - "nvidia.com/gpu.present": "true", - "nvidia.com/gpu.product": "NVIDIA-H100-80GB-HBM3-MIG-3g.40gb", - "nvidia.com/gpu.replicas": "1", - "nvidia.com/gpu.sharing-strategy": "none", - "nvidia.com/gpu.slices.ci": "3", - "nvidia.com/gpu.slices.gi": "3", - "nvidia.com/mig.capable": "true", - "nvidia.com/mig.config": "all-3g.40gb", - "nvidia.com/mig.config.state": "success", - "nvidia.com/mig.strategy": "single", - "nvidia.com/mps.capable": "false" - } - ``` - -### Example: Custom MIG Configuration During Installation - -If you need to use custom profiles, you can create a custom ConfigMap during installation by passing in a name and data for the ConfigMap with the Helm command. - -The MIG Manager daemonset is configured to use this ConfigMap instead of the auto-generated one. - -In your values.yaml file, set `migManager.config.create` to `true`, set `migManager.config.name`, and add the ConfigMap data under `migManager.config.data`, for example: - -1. In your `values.yaml` file, add the data for the ConfigMap, like the following example: - -> [!NOTE] -> Custom ConfigMaps must contain a key named "config.yaml" - -1. Install or upgrade the GPU Operator with this values file so the chart creates the ConfigMap: - - ```console - $ helm upgrade --install gpu-operator -n gpu-operator --create-namespace \ - nvidia/gpu-operator --version= \ - -f values.yaml - ``` - -1. If the custom configuration specifies more than one instance profile, set the strategy to `mixed`: - - ```console - $ kubectl patch clusterpolicies.nvidia.com/cluster-policy \ - --type='json' \ - -p='[{"op":"replace", "path":"/spec/mig/strategy", "value":"mixed"}]' - ``` - -1. Label the nodes with the profile to configure: - - ```console - $ kubectl label nodes nvidia.com/mig.config=custom-mig --overwrite - ``` - -1. Optional: Monitor the MIG Manager logs to confirm the new MIG geometry is applied: - - ```console - $ kubectl logs -n gpu-operator -l app=nvidia-mig-manager -c nvidia-mig-manager - ``` - - *Example Output* - - ```console - Applying the selected MIG config to the node - time="2024-05-15T13:40:08Z" level=debug msg="Parsing config file..." - time="2024-05-15T13:40:08Z" level=debug msg="Selecting specific MIG config..." - time="2024-05-15T13:40:08Z" level=debug msg="Running apply-start hook" - time="2024-05-15T13:40:08Z" level=debug msg="Checking current MIG mode..." - time="2024-05-15T13:40:08Z" level=debug msg="Walking MigConfig for (devices=all)" - time="2024-05-15T13:40:08Z" level=debug msg=" GPU 0: 0x233010DE" - time="2024-05-15T13:40:08Z" level=debug msg=" Asserting MIG mode: Enabled" - time="2024-05-15T13:40:08Z" level=debug msg=" MIG capable: true\n" - time="2024-05-15T13:40:08Z" level=debug msg=" Current MIG mode: Enabled" - time="2024-05-15T13:40:08Z" level=debug msg="Checking current MIG device configuration..." - time="2024-05-15T13:40:08Z" level=debug msg="Walking MigConfig for (devices=all)" - time="2024-05-15T13:40:08Z" level=debug msg=" GPU 0: 0x233010DE" - time="2024-05-15T13:40:08Z" level=debug msg=" Asserting MIG config: map[1g.10gb:5 2g.20gb:1]" - time="2024-05-15T13:40:08Z" level=debug msg="Running pre-apply-config hook" - time="2024-05-15T13:40:08Z" level=debug msg="Applying MIG device configuration..." - time="2024-05-15T13:40:08Z" level=debug msg="Walking MigConfig for (devices=all)" - time="2024-05-15T13:40:08Z" level=debug msg=" GPU 0: 0x233010DE" - time="2024-05-15T13:40:08Z" level=debug msg=" MIG capable: true\n" - time="2024-05-15T13:40:08Z" level=debug msg=" Updating MIG config: map[1g.10gb:5 2g.20gb:1]" - time="2024-05-15T13:40:09Z" level=debug msg="Running apply-exit hook" - MIG configuration applied successfully - ``` - -### Example: Custom MIG Configuration - -You can create and apply a ConfigMap yourself if the default profiles do not meet your needs. - -1. Create a file, such as `custom-mig-config.yaml`, with contents like the following example: - - ```yaml - apiVersion: v1 - kind: ConfigMap - metadata: - name: custom-mig-config - data: - config.yaml: | - version: v1 - mig-configs: - all-disabled: - - devices: all - mig-enabled: false - - five-1g-one-2g: - - devices: all - mig-enabled: true - mig-devices: - "1g.10gb": 5 - "2g.20gb": 1 - ``` - -> [!NOTE] -> Custom ConfigMaps must contain a key named "config.yaml" - -1. Apply the manifest: - - ```console - $ kubectl apply -n gpu-operator -f custom-mig-config.yaml - ``` - -1. If the custom configuration specifies more than one instance profile, set the strategy to `mixed`: - - ```console - $ kubectl patch clusterpolicies.nvidia.com/cluster-policy \ - --type='json' \ - -p='[{"op":"replace", "path":"/spec/mig/strategy", "value":"mixed"}]' - ``` - -1. Patch the cluster policy so MIG Manager uses the custom ConfigMap: - - ```console - $ kubectl patch clusterpolicies.nvidia.com/cluster-policy \ - --type='json' \ - -p='[{"op":"replace", "path":"/spec/migManager/config/name", "value":"custom-mig-config"}]' - ``` - -1. Label the nodes with the profile to configure: - - ```console - $ kubectl label nodes nvidia.com/mig.config=five-1g-one-2g --overwrite - ``` - -## Verification: Running Sample CUDA Workloads - -## Disabling MIG - -You can disable MIG on a node by setting the `nvidia.com/mig.config` label to `all-disabled`: - -```console -$ kubectl label nodes nvidia.com/mig.config=all-disabled --overwrite -``` - -## MIG Manager with Preinstalled Drivers - -MIG Manager supports preinstalled drivers. -Information in the preceding sections still applies, however there are a few additional details to consider. - -### Install - -During GPU Operator installation, `driver.enabled=false` must be set. The following options -can be used to install the GPU Operator: - -```console -$ helm install gpu-operator \ - -n gpu-operator --create-namespace \ - nvidia/gpu-operator \ - --version= \ - --set driver.enabled=false -``` - -### Managing Host GPU Clients - -MIG Manager stops all operator-managed pods that have access to GPUs when applying a MIG reconfiguration. -When drivers are preinstalled, there can be GPU clients on the host that also need to be stopped. - -When drivers are preinstalled, MIG Manager attempts to stop and restart a list of systemd services on the host across a MIG reconfiguration. -The list of services is specified in the `default-gpu-clients` ConfigMap. - -The following sample GPU clients file, `clients.yaml`, is used to create the `default-gpu-clients` ConfigMap: - -```yaml -version: v1 -systemd-services: - - nvsm.service - - nvsm-mqtt.service - - nvsm-core.service - - nvsm-api-gateway.service - - nvsm-notifier.service - - nv_peer_mem.service - - nvidia-dcgm.service - - dcgm.service - - dcgm-exporter.service -``` - -You can modify the list by editing the ConfigMap after installation. -Alternatively, you can create a custom ConfigMap for use by MIG Manager by performing the following steps: - -1. Create the `gpu-operator` namespace: - - ```console - $ kubectl create namespace gpu-operator - ``` - -1. Create a `ConfigMap` containing the custom `clients.yaml` file with a list of GPU clients: - - ```console - $ kubectl create configmap -n gpu-operator gpu-clients --from-file=clients.yaml - ``` - -1. Install the GPU Operator: - - ```console - $ helm install gpu-operator \ - -n gpu-operator --create-namespace \ - nvidia/gpu-operator \ - --version= \ - --set migManager.gpuClientsConfig.name=gpu-clients \ - --set driver.enabled=false - ``` - -## Architecture - -MIG Manager is designed as a controller within Kubernetes. It watches for changes to the -`nvidia.com/mig.config` label on the node and then applies the user-requested MIG configuration. -When the label changes, MIG Manager first stops all GPU pods, including device plugin, GPU feature discovery, -and DCGM exporter. -MIG Manager then stops all host GPU clients listed in the `clients.yaml` ConfigMap if drivers are preinstalled. -Finally, it applies the MIG reconfiguration and restarts the GPU pods and possibly, host GPU clients. -The MIG reconfiguration can also involve rebooting a node if a reboot is required to enable MIG mode. - -The default MIG profiles are specified in the `-mig-config` ConfigMap. -This ConfigMap is auto-generated by the MIG Manager for each MIG-capable node and contains the standard MIG profiles for the available GPUs on the node. -You can also configure the operator to configure a custom ConfigMap to use instead of the auto-generated one. - -You can specify one of these profiles to apply to the `mig.config` label to trigger a reconfiguration of the MIG geometry. +## Activation -MIG Manager uses the [mig-parted](https://github.com/NVIDIA/mig-parted) tool to apply the configuration -changes to the GPU, including enabling MIG mode, with a node reboot as required by some scenarios. +Do this first: identify which phase the user's request maps to in the Phases +table below, then **read the corresponding `references/.md` file before +acting**. All Helm/kubectl command sequences, MIG profile labels, ConfigMap +manifests, and expected node-label output live only in those reference files — +do not improvise commands from this dispatch layer. -```mermaid -flowchart +## Phases -subgraph mig[MIG Manager] - direction TB - A[Controller] <--> B[MIG-Parted] -end +| Phase | Summary | Reference | +|-------|---------|-----------| +| Concepts & install | What MIG is, the `single` vs `mixed` strategies, enabling MIG and deploying MIG Manager at install time, how MIG profiles/ConfigMaps are generated, and the MIG Manager architecture. | [references/concepts-and-install.md](references/concepts-and-install.md) | +| Examples | Worked examples: single strategy, mixed strategy, dynamic reconfiguration, custom ConfigMap at install, custom ConfigMap applied manually, verification, and disabling MIG. | [references/examples.md](references/examples.md) | +| Preinstalled drivers | Using MIG Manager when GPU drivers are preinstalled on the host (`driver.enabled=false`) and managing host GPU clients via the `clients.yaml` ConfigMap. | [references/preinstalled-drivers.md](references/preinstalled-drivers.md) | -A -- on change --> C +## Hard rules (apply across all phases) -subgraph recon[Reconfiguration] - C["Config is Pending - or Rebooting"] - --> - D["Stop Operator Pods"] - --> - E["Enable MIG Mode and - Reboot if Required"] - --> - F["Use mig-parted to - Configure MIG Geometry"] - --> - G["Restart Operator Pods"] -end +- You must enable MIG and choose a strategy (`single` or `mixed`) at install before you can configure MIG profiles. +- MIG Manager requires that no user workloads run on the GPUs being configured; cordon nodes that may reboot (for example CSP environments with `WITH_REBOOT`). +- Use `mixed` strategy whenever a node's configuration specifies more than one instance profile. +- Custom MIG ConfigMaps must contain a key named `config.yaml`. +- Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). Never hardcode a version. -H["Set mig.config label - to Success"] -I["Set mig.config label - to Failed"] +## Verification -G --> H -G -- on failure --> I -``` +After labeling a node with a MIG profile, confirm `nvidia.com/mig.config.state: +success` on the node and run a sample CUDA workload that requests a MIG resource. +Exact commands and expected label output are in +[references/examples.md](references/examples.md). diff --git a/gpu-operator/.agents/skills/gpu-operator-multiinstance/references/concepts-and-install.md b/gpu-operator/.agents/skills/gpu-operator-multiinstance/references/concepts-and-install.md new file mode 100644 index 000000000..b5bd1ce6b --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-multiinstance/references/concepts-and-install.md @@ -0,0 +1,139 @@ + + + +# Multi-Instance GPU: Concepts, Enabling MIG, and Configuring Profiles + +## About Multi-Instance GPU + +Multi-Instance GPU (MIG) enables GPUs based on the NVIDIA Ampere and later architectures, such as NVIDIA A100, to be partitioned into separate and secure GPU instances for CUDA applications. +Refer to the [MIG User Guide](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html) for more information about MIG. + +GPU Operator deploys MIG Manager to manage MIG configuration on nodes in your Kubernetes cluster. +You must enable MIG during installation by choosing a MIG strategy before you can configure MIG. + +Refer to the [Multi-Instance GPU architecture](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-mig.html) for more information about how MIG is implemented in the GPU Operator. + +## Enabling MIG During Installation + +Use the following steps to enable MIG and deploy MIG Manager. + +> [!NOTE] +> Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). + +1. Install the Operator: + + ```console + $ helm install --wait --generate-name \ + -n gpu-operator --create-namespace \ + nvidia/gpu-operator \ + --version= \ + --set mig.strategy=single + ``` + + This example sets `single` as the MIG strategy. + Available MIG strategy options: + + * `single`: MIG mode is enabled on all GPUs on a node. + * `mixed`: MIG mode is not enabled on all GPUs on a node. + + In a cloud service provider (CSP) environment such as Google Cloud, also specify + `--set migManager.env[0].name=WITH_REBOOT --set-string migManager.env[0].value=true` + to ensure that the node reboots and can apply the MIG configuration. + + MIG Manager supports preinstalled drivers, meaning drivers that are not managed by the GPU Operator and you installed directly on the host. + If drivers are preinstalled, also specify `--set driver.enabled=false`. + Refer to [MIG with pre-installed drivers](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-mig.html) for more details. + + After several minutes, all GPU Operator pods, including the `nvidia-mig-manager` are deployed on nodes that have MIG capable GPUs. + + > [!NOTE] + > MIG Manager requires that no user workloads are running on the GPUs being configured. + > In some cases, the node might need to be rebooted, such as a CSP, so the node might need to be cordoned + > before changing the MIG mode or the MIG geometry on the GPUs. + + 1. Optional: Display the pods in the Operator namespace: + + ```console + $ kubectl get pods -n gpu-operator + ``` + + *Example Output* + +1. Optional: Display the labels applied to the node: + + ```console + $ kubectl get node -o json | jq '.items[].metadata.labels' + ``` + + *Partial Output* + +## Configuring MIG Profiles + +When MIG is enabled, nodes are labeled with `nvidia.com/mig.config: all-disabled` by default. +To use a profile on a node, update the label value with the desired profile, for example, `nvidia.com/mig.config=all-1g.10gb`. + +Introduced in GPU Operator v26.3.0, MIG Manager generates the MIG configuration for a node at runtime from the available hardware. +The configuration is generated on startup, discovering MIG profiles for each MIG-capable GPU on a node using [NVIDIA Management Library (NVML)](https://developer.nvidia.com/management-library-nvml), then writing it to a ConfigMap for each MIG-capable node in your cluster. +The ConfigMap is named `-mig-config`, where `` is the name of each MIG-capable node. +Each ConfigMap contains a complete mig-parted config, including `all-disabled`, `all-enabled`, per-profile configs such as `all-1g.10gb`, and `all-balanced` with device-filter support for mixed GPU types. +When a new MIG-capable GPU is added to a node, the new GPU is automatically added to the ConfigMap. + +If you need custom profiles, you can use a custom MIG configuration instead of the generated one. +You can use the Helm chart to create a ConfigMap from values at install time, or create and reference your own ConfigMap. +For an example, refer to dynamically-creating-the-mig-configuration-configmap. + +> [!NOTE] +> Generated MIG configuration might not be available on older drivers, such as 535 branch GPU drivers, as they do not support querying MIG profiles when MIG mode is disabled. In those cases, the GPU Operator will use a [static Configmap](https://github.com/NVIDIA/gpu-operator/blob/main/assets/state-mig-manager/0400_configmap.yaml), `default-mig-parted-config`, for MIG profiles. + +## Architecture + +MIG Manager is designed as a controller within Kubernetes. It watches for changes to the +`nvidia.com/mig.config` label on the node and then applies the user-requested MIG configuration. +When the label changes, MIG Manager first stops all GPU pods, including device plugin, GPU feature discovery, +and DCGM exporter. +MIG Manager then stops all host GPU clients listed in the `clients.yaml` ConfigMap if drivers are preinstalled. +Finally, it applies the MIG reconfiguration and restarts the GPU pods and possibly, host GPU clients. +The MIG reconfiguration can also involve rebooting a node if a reboot is required to enable MIG mode. + +The default MIG profiles are specified in the `-mig-config` ConfigMap. +This ConfigMap is auto-generated by the MIG Manager for each MIG-capable node and contains the standard MIG profiles for the available GPUs on the node. +You can also configure the operator to configure a custom ConfigMap to use instead of the auto-generated one. + +You can specify one of these profiles to apply to the `mig.config` label to trigger a reconfiguration of the MIG geometry. + +MIG Manager uses the [mig-parted](https://github.com/NVIDIA/mig-parted) tool to apply the configuration +changes to the GPU, including enabling MIG mode, with a node reboot as required by some scenarios. + +```mermaid +flowchart + +subgraph mig[MIG Manager] + direction TB + A[Controller] <--> B[MIG-Parted] +end + +A -- on change --> C + +subgraph recon[Reconfiguration] + C["Config is Pending + or Rebooting"] + --> + D["Stop Operator Pods"] + --> + E["Enable MIG Mode and + Reboot if Required"] + --> + F["Use mig-parted to + Configure MIG Geometry"] + --> + G["Restart Operator Pods"] +end + +H["Set mig.config label + to Success"] +I["Set mig.config label + to Failed"] + +G --> H +G -- on failure --> I +``` diff --git a/gpu-operator/.agents/skills/gpu-operator-multiinstance/references/examples.md b/gpu-operator/.agents/skills/gpu-operator-multiinstance/references/examples.md new file mode 100644 index 000000000..93e3aaf71 --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-multiinstance/references/examples.md @@ -0,0 +1,342 @@ + + + +# MIG Configuration Examples + +Throughout, replace `` with your target GPU Operator release. + +## Example: Single MIG Strategy + +The following steps show how to use the single MIG strategy and configure the `1g.10gb` profile on one node. + +1. Configure the MIG strategy to `single` if you are unsure of the current strategy: + + ```console + $ kubectl patch clusterpolicies.nvidia.com/cluster-policy \ + --type='json' \ + -p='[{"op":"replace", "path":"/spec/mig/strategy", "value":"single"}]' + ``` + +1. Label the nodes with the profile to configure: + + ```console + $ kubectl label nodes nvidia.com/mig.config=all-1g.10gb --overwrite + ``` + + MIG Manager proceeds to apply a `mig.config.state` label to the node and terminates all + the GPU pods in preparation to enable MIG mode and configure the GPU into the desired MIG geometry. + +1. Optional: Display the node labels: + + ```console + $ kubectl get node -o=jsonpath='{.metadata.labels}' | jq . + ``` + + *Partial Output* + + ```json + "nvidia.com/gpu.product": "NVIDIA-H100-80GB-HBM3", + "nvidia.com/gpu.replicas": "1", + "nvidia.com/gpu.sharing-strategy": "none", + "nvidia.com/mig.capable": "true", + "nvidia.com/mig.config": "all-1g.10gb", + "nvidia.com/mig.config.state": "pending", + "nvidia.com/mig.strategy": "single" + } + ``` + + When the `WITH_REBOOT` option is set, MIG Manager sets the label to `nvidia.com/mig.config.state: rebooting`. + +1. Confirm that MIG Manager completed the configuration by checking the node labels: + + ```console + $ kubectl get node -o=jsonpath='{.metadata.labels}' | jq . + ``` + + Check for the following labels: + + * `nvidia.com/gpu.count: 7` (the value differs according to the GPU model) + * `nvidia.com/gpu.slices.ci: 1` + * `nvidia.com/gpu.slices.gi: 1` + * `nvidia.com/mig.config.state: success` + + *Partial Output* + + ```json + "nvidia.com/gpu.count": "7", + "nvidia.com/gpu.present": "true", + "nvidia.com/gpu.product": "NVIDIA-H100-80GB-HBM3-MIG-1g.10gb", + "nvidia.com/gpu.slices.ci": "1", + "nvidia.com/gpu.slices.gi": "1", + "nvidia.com/mig.capable": "true", + "nvidia.com/mig.config": "all-1g.10gb", + "nvidia.com/mig.config.state": "success", + "nvidia.com/mig.strategy": "single" + ``` + +1. Optional: Run the `nvidia-smi` command in the driver container to verify that the MIG configuration has been applied. + + ```console + $ kubectl exec -it -n gpu-operator ds/nvidia-driver-daemonset -- nvidia-smi -L + ``` + + *Example Output* + +## Example: Mixed MIG Strategy + +The following steps show how to use the `mixed` MIG strategy and configure the `all-balanced` profile on one node. + +1. Configure the MIG strategy to `mixed` if you are unsure of the current strategy: + + ```console + $ kubectl patch clusterpolicies.nvidia.com/cluster-policy \ + --type='json' \ + -p='[{"op":"replace", "path":"/spec/mig/strategy", "value":"mixed"}]' + ``` + +1. Label the nodes with the profile to configure: + + ```console + $ kubectl label nodes nvidia.com/mig.config=all-balanced --overwrite + ``` + + MIG Manager proceeds to apply a `mig.config.state` label to the node and terminates all + the GPU pods in preparation to enable MIG mode and configure the GPU into the desired MIG geometry. + +1. Confirm that MIG Manager completed the configuration by checking the node labels: + + ```console + $ kubectl get node -o=jsonpath='{.metadata.labels}' | jq . + ``` + + Check for labels like the following. + The profiles and GPU counts differ according to the GPU model. + + * `nvidia.com/mig-1g.10gb.count: 2` + * `nvidia.com/mig-2g.20gb.count: 1` + * `nvidia.com/mig-3g.40gb.count: 1` + * `nvidia.com/mig.config.state: success` + + *Partial Output* + +1. Optional: Run the `nvidia-smi` command in the driver container to verify that the GPU has been configured. + + ```console + $ kubectl exec -it -n gpu-operator ds/nvidia-driver-daemonset -- nvidia-smi -L + ``` + + *Example Output* + +## Example: Reconfiguring MIG Profiles + +MIG Manager supports dynamic reconfiguration of the MIG geometry. +The following steps show how to update a GPU on a node to the `3g.40gb` profile with the single MIG strategy. + +1. Label the node with the profile: + + ```console + $ kubectl label nodes nvidia.com/mig.config=all-3g.40gb --overwrite + ``` + +1. Optional: Monitor the MIG Manager logs to confirm the new MIG geometry is applied: + + ```console + $ kubectl logs -n gpu-operator -l app=nvidia-mig-manager -c nvidia-mig-manager + ``` + + *Example Output* + + ```console + Applying the selected MIG config to the node + time="2024-05-14T18:31:26Z" level=debug msg="Parsing config file..." + time="2024-05-14T18:31:26Z" level=debug msg="Selecting specific MIG config..." + time="2024-05-14T18:31:26Z" level=debug msg="Running apply-start hook" + time="2024-05-14T18:31:26Z" level=debug msg="Checking current MIG mode..." + time="2024-05-14T18:31:26Z" level=debug msg="Walking MigConfig for (devices=all)" + time="2024-05-14T18:31:26Z" level=debug msg=" GPU 0: 0x233010DE" + time="2024-05-14T18:31:26Z" level=debug msg=" Asserting MIG mode: Enabled" + time="2024-05-14T18:31:26Z" level=debug msg=" MIG capable: true\n" + time="2024-05-14T18:31:26Z" level=debug msg=" Current MIG mode: Enabled" + time="2024-05-14T18:31:26Z" level=debug msg="Checking current MIG device configuration..." + time="2024-05-14T18:31:26Z" level=debug msg="Walking MigConfig for (devices=all)" + time="2024-05-14T18:31:26Z" level=debug msg=" GPU 0: 0x233010DE" + time="2024-05-14T18:31:26Z" level=debug msg=" Asserting MIG config: map[3g.40gb:2]" + time="2024-05-14T18:31:26Z" level=debug msg="Running pre-apply-config hook" + time="2024-05-14T18:31:26Z" level=debug msg="Applying MIG device configuration..." + time="2024-05-14T18:31:26Z" level=debug msg="Walking MigConfig for (devices=all)" + time="2024-05-14T18:31:26Z" level=debug msg=" GPU 0: 0x233010DE" + time="2024-05-14T18:31:26Z" level=debug msg=" MIG capable: true\n" + time="2024-05-14T18:31:26Z" level=debug msg=" Updating MIG config: map[3g.40gb:2]" + MIG configuration applied successfully + time="2024-05-14T18:31:27Z" level=debug msg="Running apply-exit hook" + Restarting validator pod to re-run all validations + pod "nvidia-operator-validator-kmncw" deleted + Restarting all GPU clients previously shutdown in Kubernetes by reenabling their component-specific nodeSelector labels + node/node-name labeled + Changing the 'nvidia.com/mig.config.state' node label to 'success' + ``` + +1. Optional: Display the node labels to confirm the GPU count (`2`), slices (`3`), and profile are set: + + ```console + $ kubectl get node -o=jsonpath='{.metadata.labels}' | jq . + ``` + + *Partial Output* + + ```json + "nvidia.com/gpu.count": "2", + "nvidia.com/gpu.present": "true", + "nvidia.com/gpu.product": "NVIDIA-H100-80GB-HBM3-MIG-3g.40gb", + "nvidia.com/gpu.replicas": "1", + "nvidia.com/gpu.sharing-strategy": "none", + "nvidia.com/gpu.slices.ci": "3", + "nvidia.com/gpu.slices.gi": "3", + "nvidia.com/mig.capable": "true", + "nvidia.com/mig.config": "all-3g.40gb", + "nvidia.com/mig.config.state": "success", + "nvidia.com/mig.strategy": "single", + "nvidia.com/mps.capable": "false" + } + ``` + +## Example: Custom MIG Configuration During Installation + +If you need to use custom profiles, you can create a custom ConfigMap during installation by passing in a name and data for the ConfigMap with the Helm command. + +The MIG Manager daemonset is configured to use this ConfigMap instead of the auto-generated one. + +In your values.yaml file, set `migManager.config.create` to `true`, set `migManager.config.name`, and add the ConfigMap data under `migManager.config.data`, for example: + +1. In your `values.yaml` file, add the data for the ConfigMap, like the following example: + +> [!NOTE] +> Custom ConfigMaps must contain a key named "config.yaml" + +1. Install or upgrade the GPU Operator with this values file so the chart creates the ConfigMap: + + ```console + $ helm upgrade --install gpu-operator -n gpu-operator --create-namespace \ + nvidia/gpu-operator --version= \ + -f values.yaml + ``` + +1. If the custom configuration specifies more than one instance profile, set the strategy to `mixed`: + + ```console + $ kubectl patch clusterpolicies.nvidia.com/cluster-policy \ + --type='json' \ + -p='[{"op":"replace", "path":"/spec/mig/strategy", "value":"mixed"}]' + ``` + +1. Label the nodes with the profile to configure: + + ```console + $ kubectl label nodes nvidia.com/mig.config=custom-mig --overwrite + ``` + +1. Optional: Monitor the MIG Manager logs to confirm the new MIG geometry is applied: + + ```console + $ kubectl logs -n gpu-operator -l app=nvidia-mig-manager -c nvidia-mig-manager + ``` + + *Example Output* + + ```console + Applying the selected MIG config to the node + time="2024-05-15T13:40:08Z" level=debug msg="Parsing config file..." + time="2024-05-15T13:40:08Z" level=debug msg="Selecting specific MIG config..." + time="2024-05-15T13:40:08Z" level=debug msg="Running apply-start hook" + time="2024-05-15T13:40:08Z" level=debug msg="Checking current MIG mode..." + time="2024-05-15T13:40:08Z" level=debug msg="Walking MigConfig for (devices=all)" + time="2024-05-15T13:40:08Z" level=debug msg=" GPU 0: 0x233010DE" + time="2024-05-15T13:40:08Z" level=debug msg=" Asserting MIG mode: Enabled" + time="2024-05-15T13:40:08Z" level=debug msg=" MIG capable: true\n" + time="2024-05-15T13:40:08Z" level=debug msg=" Current MIG mode: Enabled" + time="2024-05-15T13:40:08Z" level=debug msg="Checking current MIG device configuration..." + time="2024-05-15T13:40:08Z" level=debug msg="Walking MigConfig for (devices=all)" + time="2024-05-15T13:40:08Z" level=debug msg=" GPU 0: 0x233010DE" + time="2024-05-15T13:40:08Z" level=debug msg=" Asserting MIG config: map[1g.10gb:5 2g.20gb:1]" + time="2024-05-15T13:40:08Z" level=debug msg="Running pre-apply-config hook" + time="2024-05-15T13:40:08Z" level=debug msg="Applying MIG device configuration..." + time="2024-05-15T13:40:08Z" level=debug msg="Walking MigConfig for (devices=all)" + time="2024-05-15T13:40:08Z" level=debug msg=" GPU 0: 0x233010DE" + time="2024-05-15T13:40:08Z" level=debug msg=" MIG capable: true\n" + time="2024-05-15T13:40:08Z" level=debug msg=" Updating MIG config: map[1g.10gb:5 2g.20gb:1]" + time="2024-05-15T13:40:09Z" level=debug msg="Running apply-exit hook" + MIG configuration applied successfully + ``` + +## Example: Custom MIG Configuration + +You can create and apply a ConfigMap yourself if the default profiles do not meet your needs. + +1. Create a file, such as `custom-mig-config.yaml`, with contents like the following example: + + ```yaml + apiVersion: v1 + kind: ConfigMap + metadata: + name: custom-mig-config + data: + config.yaml: | + version: v1 + mig-configs: + all-disabled: + - devices: all + mig-enabled: false + + five-1g-one-2g: + - devices: all + mig-enabled: true + mig-devices: + "1g.10gb": 5 + "2g.20gb": 1 + ``` + +> [!NOTE] +> Custom ConfigMaps must contain a key named "config.yaml" + +1. Apply the manifest: + + ```console + $ kubectl apply -n gpu-operator -f custom-mig-config.yaml + ``` + +1. If the custom configuration specifies more than one instance profile, set the strategy to `mixed`: + + ```console + $ kubectl patch clusterpolicies.nvidia.com/cluster-policy \ + --type='json' \ + -p='[{"op":"replace", "path":"/spec/mig/strategy", "value":"mixed"}]' + ``` + +1. Patch the cluster policy so MIG Manager uses the custom ConfigMap: + + ```console + $ kubectl patch clusterpolicies.nvidia.com/cluster-policy \ + --type='json' \ + -p='[{"op":"replace", "path":"/spec/migManager/config/name", "value":"custom-mig-config"}]' + ``` + +1. Label the nodes with the profile to configure: + + ```console + $ kubectl label nodes nvidia.com/mig.config=five-1g-one-2g --overwrite + ``` + +## Verification: Running Sample CUDA Workloads + +After configuring a MIG profile and confirming `nvidia.com/mig.config.state: success`, +deploy a sample CUDA workload that requests a MIG resource to confirm scheduling. +Use the `gpu-operator-install` skill's verification workload (CUDA VectorAdd) as a basis, +requesting the appropriate MIG resource (for example `nvidia.com/mig-1g.10gb`). + +## Disabling MIG + +You can disable MIG on a node by setting the `nvidia.com/mig.config` label to `all-disabled`: + +```console +$ kubectl label nodes nvidia.com/mig.config=all-disabled --overwrite +``` diff --git a/gpu-operator/.agents/skills/gpu-operator-multiinstance/references/preinstalled-drivers.md b/gpu-operator/.agents/skills/gpu-operator-multiinstance/references/preinstalled-drivers.md new file mode 100644 index 000000000..a2fb2cbaf --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-multiinstance/references/preinstalled-drivers.md @@ -0,0 +1,72 @@ + + + +# MIG Manager with Preinstalled Drivers + +MIG Manager supports preinstalled drivers. +Information in the preceding sections still applies, however there are a few additional details to consider. + +Throughout, replace `` with your target GPU Operator release. + +## Install + +During GPU Operator installation, `driver.enabled=false` must be set. The following options +can be used to install the GPU Operator: + +```console +$ helm install gpu-operator \ + -n gpu-operator --create-namespace \ + nvidia/gpu-operator \ + --version= \ + --set driver.enabled=false +``` + +## Managing Host GPU Clients + +MIG Manager stops all operator-managed pods that have access to GPUs when applying a MIG reconfiguration. +When drivers are preinstalled, there can be GPU clients on the host that also need to be stopped. + +When drivers are preinstalled, MIG Manager attempts to stop and restart a list of systemd services on the host across a MIG reconfiguration. +The list of services is specified in the `default-gpu-clients` ConfigMap. + +The following sample GPU clients file, `clients.yaml`, is used to create the `default-gpu-clients` ConfigMap: + +```yaml +version: v1 +systemd-services: + - nvsm.service + - nvsm-mqtt.service + - nvsm-core.service + - nvsm-api-gateway.service + - nvsm-notifier.service + - nv_peer_mem.service + - nvidia-dcgm.service + - dcgm.service + - dcgm-exporter.service +``` + +You can modify the list by editing the ConfigMap after installation. +Alternatively, you can create a custom ConfigMap for use by MIG Manager by performing the following steps: + +1. Create the `gpu-operator` namespace: + + ```console + $ kubectl create namespace gpu-operator + ``` + +1. Create a `ConfigMap` containing the custom `clients.yaml` file with a list of GPU clients: + + ```console + $ kubectl create configmap -n gpu-operator gpu-clients --from-file=clients.yaml + ``` + +1. Install the GPU Operator: + + ```console + $ helm install gpu-operator \ + -n gpu-operator --create-namespace \ + nvidia/gpu-operator \ + --version= \ + --set migManager.gpuClientsConfig.name=gpu-clients \ + --set driver.enabled=false + ``` diff --git a/gpu-operator/.agents/skills/gpu-operator-nvidia-driver/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-nvidia-driver/SKILL.md index e4a1c0f59..f206095ba 100644 --- a/gpu-operator/.agents/skills/gpu-operator-nvidia-driver/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-nvidia-driver/SKILL.md @@ -20,432 +20,41 @@ tags: # NVIDIA GPU Driver Custom Resource Definition +Configure NVIDIA GPU Driver (`NVIDIADriver`) custom resources to manage the +driver type and version per node, including mixed driver types, mixed versions, +and mixed operating systems within a single cluster. + ## Prerequisites - A running Kubernetes cluster with NVIDIA GPU worker nodes. - The NVIDIA GPU Operator installed with the driver custom resource enabled (`--set driver.nvidiaDriverCRD.enabled=true`). Use the `gpu-operator-install` skill to install the Operator. - This feature is recommended for new cluster installations only. You cannot use ClusterPolicy-managed drivers and the `NVIDIADriver` custom resource at the same time. -## Overview of the GPU Driver Custom Resource Definition - -You can create one or more instances of an NVIDIA driver (`NVIDIADriver`) custom resource -to specify the NVIDIA GPU driver type and driver version to configure on specific nodes. -You can specify labels in the node selector field to control which NVIDIA driver configuration is applied to specific nodes. - -### Limitations - -* This feature is recommended for new cluster installations only. - Upgrades from ClusterPolicy managed drivers to NVIDIA driver custom resource managed drivers are not supported. - Switching from ClusterPolicy to the NVIDIA driver custom resource will cause all existing driver pods to be terminated immediately and redeployed using the new NVIDIADriver configuration. -* You must either use the default NVIDIA driver custom resource that the Helm chart creates or create and manage your own custom NVIDIA driver custom resource. -* You can't use ClusterPolicy and the NVIDIA driver custom resource at the same time. You can only use one or the other in a cluster. - -### Comparison: Managing the Driver with CRD versus the Cluster Policy - -Before the introduction of the NVIDIA GPU Driver custom resource definition, you managed the driver by modifying -the driver field and subfields of the cluster policy custom resource definition. - -The key differences between the two approaches are summarized in the following table. - -| Cluster Policy CRD | NVIDIA Driver CRD * - | Supports a single driver type and version on all nodes. | Does not support multiple operating system versions. This limitation complicates performing an operating system upgrade on your nodes. - | Supports multiple driver types and versions on different nodes. | Supports multiple operating system versions on nodes. | -| --- | --- | --- | --- | --- | --- | -### Driver Daemon Sets - -The NVIDIA GPU Operator starts a driver daemon set for each NVIDIA driver custom resource and each operating system version. - -For example, if your cluster has one NVIDIA driver custom resource that specifies a 580 branch GPU driver and some -worker nodes run Ubuntu 20.04 and other worker nodes run Ubuntu 22.04, the Operator starts two driver daemon sets. -One daemon set configures the GPU driver on the Ubuntu 20.04 nodes and the other configures the driver on the Ubuntu 22.04 nodes. -All the nodes run the same 580 branch GPU driver. - -![](graphics/nvd-basics.svg) -If you choose to use precompiled driver containers, the Operator starts a driver daemon set for each Linux kernel version. - -For example, if some nodes run Ubuntu 22.04 and the 5.15.0-84-generic kernel, and other nodes run the 5.15.0-78-generic kernel, -then the Operator starts two daemon sets. - -### About the Default NVIDIA Driver Custom Resource - -By default, the Helm chart configures a default NVIDIA driver custom resource during installation. -This custom resource does not include a node selector and as a result, the custom resource applies to every node in your cluster -that has an NVIDIA GPU. -The Operator starts a driver daemon set and pods for each operating system version in your cluster. - -If you plan to configure your own driver custom resources to specify driver versions, types, and so on, then -you might prefer to avoid installing the default custom resource. -By preventing the installation, you can avoid node selector conflicts due to the default custom resource -matching all nodes and your custom resources matching some of the same nodes. - -To prevent configuring the default custom resource, specify the `--set driver.nvidiaDriverCRD.deployDefaultCR=false` -argument when you install the Operator with Helm. - -If the Operator is already installed with the default custom resource and you want to create your own -driver custom resources and apply them to specific nodes, delete the default custom resource. - -> [!NOTE] -> After you delete the default custom resource, your custom resources might not reconcile -> automatically due to a known issue. Refer to the v26.3.0 known issues -> for the workaround. -### Feature Compatibility - -Driver type - Each NVIDIA driver custom resource specifies the driver type and is one of `gpu`, `vgpu`, or `vgpu-host-manager`. - You can run the data-center driver (`gpu`) on some nodes and the vGPU driver on other nodes. - -GPUDirect RDMA and GPUDirect Storage - Each NVIDIA driver custom resource can specify how to configure GPUDirect RDMA and GPUDirect Storage (GDS). - Refer to GPUDirect RDMA and GPUDirect Storage for the platform support and prerequisites. - -GDRCopy - Each NVIDIA driver custom resource can enable the GDRCopy sidecar container in the driver pod. - -Precompiled and signed drivers - You can run the default driver type that is compiled when the driver pod starts on some nodes - and precompiled driver containers on other nodes. - The precomp-limitations-restrictions for precompiled driver containers apply. - -Preinstalled drivers on nodes - If a node has an NVIDIA GPU driver installed in the operating system, then no driver container runs on the node. - -Support for X86_64 and ARM64 - Each daemon set can run pods and driver containers for the X86_64 and ARM64 architectures. - Refer to the [NVIDIA GPU Driver tags](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/driver/tags) - web page to determine which driver version and operating system combinations support both architectures. - -Custom Driver Parameters - Each NVIDIA driver custom resource can specify custom kernel module parameters by using a ConfigMap. - For more information, refer to Customizing NVIDIA GPU Driver Parameters during Installation (use the `gpu-operator-custom-driver` skill). - -## About the NVIDIA Driver Custom Resource - -An instance of the NVIDIA driver custom resource represents a specific NVIDIA GPU driver type and driver version to install and manage -on nodes. - -The following table describes some of the fields in the custom resource. - -| Field | Description | Default Value | | | | -| --- | --- | --- | --- | --- | --- | -| `metadata.name` | Specifies the name of the NVIDIA driver custom resource. | None | | | | -| `annotations` | Specifies a map of key and value pairs to add as custom annotations to the driver pod. | None | | | | -| `driverType` | Specifies one of the following: | `gpu` to use the NVIDIA data-center GPU driver. | `vgpu` to use the NVIDIA vGPU guest driver. | `vgpu-host-manager` to use the NVIDIA vGPU Manager. | `gpu` | -| `env` | Specifies environment variables to pass to the driver container. | None | | | | -| `gdrcopy.enabled` | Specifies whether to deploy the GDRCopy Driver. When set to `true` the GDRCopy Driver image runs as a sidecar container. | `false` | | | | -| `gds.enabled` | Specifies whether to enable GPUDirect Storage. | `false` | | | | -| `image` | Specifies the driver container image name. | `driver` | | | | -| `imagePullPolicy` | Specifies the policy for kubelet to download the container image. Refer to the Kubernetes documentation for [image pull policy](https://kubernetes.io/docs/concepts/containers/images/#image-pull-policy). | Refer to the Kubernetes documentation. | | | | -| `imagePullSecrets` | Specifies the credentials to provide to the registry if the registry is secured. | None | | | | -| `kernelModuleType` | Specifies the type of the NVIDIA GPU Kernel modules to use. Valid values are `auto` (default), `proprietary`, and `open`. `Auto` means that the recommended kernel module type is chosen based on the GPU devices on the host and the driver branch used. | `auto` | | | | -| `labels` | Specifies a map of key and value pairs to add as custom labels to the driver pod. | None | | | | -| `nodeSelector` | Specifies one or more node labels to match. The driver container is scheduled to nodes that match all the labels. | None. When you do not specify this field, the driver custom resource selects all nodes. | | | | -| `priorityClassName` | Specifies the priority class for the driver pod. | `system-node-critical` | | | | -| `rdma.enabled` | Specifies whether to enable GPUDirect RDMA. | `false` | | | | -| `repository` | Specifies the container registry that contains the driver container. | `nvcr.io/nvidia` | | | | -| `useOpenKernelModules` Deprecated. | This field is deprecated as of v25.3.0 and will be ignored. Use `kernelModuleType` instead. Specifies to use the NVIDIA Open GPU Kernel modules. | `false` | | | | -| `tolerations` | Specifies a set of tolerations to apply to the driver pod. | None | | | | -| `usePrecompiled` | When set to `true`, the Operator deploys a driver container image with a precompiled driver. | `false` | | | | -| `version` | Specifies the GPU driver version to install. For a data-center driver, specify a value like `580.126.20`. If you set `usePrecompiled` to `true`, specify the driver branch, such as `580`. | Refer to the operator-component-matrix. | | | | - -## Installing the NVIDIA GPU Operator - -Perform the following steps to install the GPU Operator and use the NVIDIA driver custom resources. - -1. Optional: If you want to run more than one driver type or version in the cluster, - label the worker nodes to identify the driver type and version to install on each node: - - *Example* - - ```console - $ kubectl label node --overwrite driver.version=580.126.20 - ``` - - - To use a mix of driver types, such as vGPU, label nodes for the driver type. - - To use a mix of driver versions, label the nodes for the different versions. - - To use a mix of conventional drivers and precompiled driver containers, label the nodes for the different types. - -1. Install the Operator. - - - Add the NVIDIA Helm repository: - - ```console - $ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \ - && helm repo update - ``` - -> [!NOTE] -> Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). - - - Install the Operator and specify at least the `--set driver.nvidiaDriverCRD.enabled=true` argument: - - ```console - $ helm install --wait --generate-name \ - -n gpu-operator --create-namespace \ - nvidia/gpu-operator \ - --version= \ - --set driver.nvidiaDriverCRD.enabled=true - ``` - - By default, Helm configures a `default` NVIDIA driver custom resource during installation. - To prevent configuring the default custom resource, also specify `--set driver.nvidiaDriverCRD.deployDefaultCR=false`. - -1. Apply NVIDIA driver custom resources manifests to install the NVIDIA GPU driver version, type, and so on for your nodes. - Refer to the sample manifests. - -## Sample NVIDIA Driver Manifests - -### One Driver Type and Version on All Nodes - -1. Optional: Remove previously applied node labels. - -1. Create a file, such as `nvd-all.yaml`, with contents like the following: +## Activation - ```yaml - apiVersion: nvidia.com/v1alpha1 - kind: NVIDIADriver - metadata: - name: nvidiadriver-sample - spec: - # use pre-compiled packages for NVIDIA driver installation. - usePrecompiled: false - driverType: gpu - repository: nvcr.io/nvidia - image: driver - version: "580.126.20" - imagePullPolicy: IfNotPresent - imagePullSecrets: [] - nodeSelector: {} - manager: {} - rdma: - enabled: false - useHostMofed: false - gds: - enabled: false - # Private mirror repository configuration - repoConfig: - name: "" - # custom ssl key/certificate configuration - certConfig: - name: "" - # vGPU licensing configuration - licensingConfig: - secretName: "" - nlsEnabled: true - # vGPU topology daemon configuration - virtualTopologyConfig: - name: "" - # kernel module configuration for NVIDIA driver - kernelModuleConfig: - name: "" - ``` +Do this first: identify which phase the user's request maps to in the Phases +table below, then **read the corresponding `references/.md` file before +acting**. The procedural detail (commands, manifest contents, field tables) +lives only in those reference files — read the relevant one rather than +improvising from this dispatch layer. -1. Apply the manifest: +## Phases - ```console - $ kubectl apply -n gpu-operator -f nvd-all.yaml - ``` +| Phase | Summary | Reference | +|-------|---------|-----------| +| Concepts | What the `NVIDIADriver` CRD is, its limitations, CRD-vs-ClusterPolicy comparison, driver daemon sets, default custom resource, feature compatibility, and the full field reference table. | [references/concepts.md](references/concepts.md) | +| Install | Install the GPU Operator with the driver CRD enabled, including optional node labeling and the Helm repository/install commands. | [references/install.md](references/install.md) | +| Manifests | Sample `NVIDIADriver` manifests: one type/version on all nodes, multiple versions, precompiled on all nodes, and precompiled on some nodes. | [references/manifests.md](references/manifests.md) | +| Upgrade & verify | Patch the driver version (rolling update) and verify that custom resources are applied and driver pods are running. | [references/upgrade-and-verify.md](references/upgrade-and-verify.md) | -1. Optional: Monitor the progress: +## Hard rules (apply across all phases) - ```console - $ kubectl get events -n gpu-operator --sort-by='.lastTimestamp' - ``` - -### Multiple Driver Versions - -1. Label the nodes. - - - On some nodes, apply a label like the following: - - ```console - $ kubectl label node --overwrite driver.config="gold" - ``` - - - On other nodes, apply a label like the following: - - ```console - $ kubectl label node --overwrite driver.config="silver" - ``` - -1. Create a file, such as `nvd-driver-multiple.yaml`, with contents like the following: - - ```yaml - apiVersion: nvidia.com/v1alpha1 - kind: NVIDIADriver - metadata: - name: demo-gold - spec: - driverType: gpu - env: [] - image: driver - imagePullPolicy: IfNotPresent - imagePullSecrets: [] - manager: {} - nodeSelector: - driver.config: "gold" - repository: nvcr.io/nvidia - version: "580.126.20" - --- - apiVersion: nvidia.com/v1alpha1 - kind: NVIDIADriver - metadata: - name: demo-silver - spec: - driverType: gpu - env: [] - image: driver - imagePullPolicy: IfNotPresent - imagePullSecrets: [] - manager: {} - nodeSelector: - driver.config: "silver" - repository: nvcr.io/nvidia - version: "470.141.10" - ``` - -1. Apply the manifest: - - ```console - $ kubectl apply -n gpu-operator -f nvd-driver-multiple.yaml - ``` - -1. Optional: Monitor the progress: - - ```console - $ kubectl get events -n gpu-operator --sort-by='.lastTimestamp' - ``` - -### One Precompiled Driver Container on All Nodes - -1. Optional: Remove previously applied node labels. - -1. Create a file, such as `nvd-precompiled-all.yaml`, with contents like the following: - - ```yaml - apiVersion: nvidia.com/v1alpha1 - kind: NVIDIADriver - metadata: - name: demo-precomp-all - spec: - driverType: gpu - env: [] - image: driver - imagePullPolicy: IfNotPresent - imagePullSecrets: [] - manager: {} - nodeSelector: {} - repository: nvcr.io/nvidia - resources: {} - usePrecompiled: true - version: "580" - ``` - - > [!TIP] - > Because the manifest does not include a `nodeSelector` field, the driver custom - > resource selects all nodes in the cluster that have an NVIDIA GPU. - - 1. Apply the manifest: - - ```console - $ kubectl apply -n gpu-operator -f nvd-precompiled-all.yaml - ``` - -1. Optional: Monitor the progress: - - ```console - $ kubectl get events -n gpu-operator --sort-by='.lastTimestamp' - ``` - -### Precompiled Driver Container on Some Nodes - -1. Label the nodes like the following sample: - - ```console - $ kubectl label node --overwrite driver.precompiled="true" - $ kubectl label node --overwrite driver.version="580" - ``` - -1. Create a file, such as `nvd-precompiled-some.yaml`, with contents like the following: - - ```yaml - apiVersion: nvidia.com/v1alpha1 - kind: NVIDIADriver - metadata: - name: demo-precomp - spec: - driverType: gpu - env: [] - image: driver - imagePullPolicy: IfNotPresent - imagePullSecrets: [] - manager: {} - nodeSelector: - driver.precompiled: "true" - driver.version: "580" - repository: nvcr.io/nvidia - resources: {} - usePrecompiled: true - version: "580" - ``` - -1. Apply the manifest: - - ```console - $ kubectl apply -n gpu-operator -f nvd-precompiled-some.yaml - ``` - -1. Optional: Monitor the progress: - - ```console - $ kubectl get events -n gpu-operator --sort-by='.lastTimestamp' - ``` - -## Upgrading the NVIDIA GPU Driver - -You can upgrade the driver version by editing or patching the NVIDIA driver custom resource. - -When you update the custom resource, the Operator performs a rolling update of the pods in the affected daemon set. - -1. Update the `driver.version` field in the driver custom resource: - - ```console - $ kubectl patch nvidiadriver/demo-silver --type='json' \ - -p='[{"op": "replace", "path": "/spec/version", "value": "525.125.06"}]' - ``` - -1. Optional: Monitor the progress: - - ```console - $ kubectl get pods -n gpu-operator -l app.kubernetes.io/component=nvidia-driver - ``` - - *Example Output* - - ```output - NAME READY STATUS RESTARTS AGE - nvidia-gpu-driver-ubuntu20.04-788484b9bb-6zhd9 1/1 Running 0 5m1s - nvidia-gpu-driver-ubuntu22.04-8896c4bf7-7s68q 1/1 Terminating 0 37m - nvidia-gpu-driver-ubuntu22.04-8896c4bf7-jm74l 1/1 Running 0 37m - ``` - -Eventually, the Operator replaces the pods that used the previous driver version with pods that use the updated driver version. +- Never use ClusterPolicy-managed drivers and the `NVIDIADriver` custom resource at the same time — choose one per cluster. +- Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). Never hardcode a specific version. +- This feature is recommended for new cluster installations only; upgrades from ClusterPolicy-managed drivers are not supported. ## Verification -Confirm that the driver custom resources are applied and the driver pods are running: - -1. List the `NVIDIADriver` custom resources and confirm their state: - - ```console - $ kubectl get nvidiadrivers - ``` - -1. Confirm the driver pods are running on the expected nodes: - - ```console - $ kubectl get pods -n gpu-operator -l app.kubernetes.io/component=nvidia-driver -o wide - ``` - - Each driver pod should report `Running`. If a pod is not progressing, inspect the events: - - ```console - $ kubectl get events -n gpu-operator --sort-by='.lastTimestamp' - ``` +After applying driver custom resources, confirm they are reconciled and the +driver pods are `Running`. The exact commands are in +[references/upgrade-and-verify.md](references/upgrade-and-verify.md). diff --git a/gpu-operator/.agents/skills/gpu-operator-nvidia-driver/references/concepts.md b/gpu-operator/.agents/skills/gpu-operator-nvidia-driver/references/concepts.md new file mode 100644 index 000000000..b27b74c11 --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-nvidia-driver/references/concepts.md @@ -0,0 +1,125 @@ + + + +# NVIDIA Driver Custom Resource: Concepts and Fields + +## Overview of the GPU Driver Custom Resource Definition + +You can create one or more instances of an NVIDIA driver (`NVIDIADriver`) custom resource +to specify the NVIDIA GPU driver type and driver version to configure on specific nodes. +You can specify labels in the node selector field to control which NVIDIA driver configuration is applied to specific nodes. + +### Limitations + +* This feature is recommended for new cluster installations only. + Upgrades from ClusterPolicy managed drivers to NVIDIA driver custom resource managed drivers are not supported. + Switching from ClusterPolicy to the NVIDIA driver custom resource will cause all existing driver pods to be terminated immediately and redeployed using the new NVIDIADriver configuration. +* You must either use the default NVIDIA driver custom resource that the Helm chart creates or create and manage your own custom NVIDIA driver custom resource. +* You can't use ClusterPolicy and the NVIDIA driver custom resource at the same time. You can only use one or the other in a cluster. + +### Comparison: Managing the Driver with CRD versus the Cluster Policy + +Before the introduction of the NVIDIA GPU Driver custom resource definition, you managed the driver by modifying +the driver field and subfields of the cluster policy custom resource definition. + +The key differences between the two approaches are summarized in the following table. + +| Cluster Policy CRD | NVIDIA Driver CRD * - | Supports a single driver type and version on all nodes. | Does not support multiple operating system versions. This limitation complicates performing an operating system upgrade on your nodes. - | Supports multiple driver types and versions on different nodes. | Supports multiple operating system versions on nodes. | +| --- | --- | --- | --- | --- | --- | + +### Driver Daemon Sets + +The NVIDIA GPU Operator starts a driver daemon set for each NVIDIA driver custom resource and each operating system version. + +For example, if your cluster has one NVIDIA driver custom resource that specifies a 580 branch GPU driver and some +worker nodes run Ubuntu 20.04 and other worker nodes run Ubuntu 22.04, the Operator starts two driver daemon sets. +One daemon set configures the GPU driver on the Ubuntu 20.04 nodes and the other configures the driver on the Ubuntu 22.04 nodes. +All the nodes run the same 580 branch GPU driver. + +![](../graphics/nvd-basics.svg) +If you choose to use precompiled driver containers, the Operator starts a driver daemon set for each Linux kernel version. + +For example, if some nodes run Ubuntu 22.04 and the 5.15.0-84-generic kernel, and other nodes run the 5.15.0-78-generic kernel, +then the Operator starts two daemon sets. + +### About the Default NVIDIA Driver Custom Resource + +By default, the Helm chart configures a default NVIDIA driver custom resource during installation. +This custom resource does not include a node selector and as a result, the custom resource applies to every node in your cluster +that has an NVIDIA GPU. +The Operator starts a driver daemon set and pods for each operating system version in your cluster. + +If you plan to configure your own driver custom resources to specify driver versions, types, and so on, then +you might prefer to avoid installing the default custom resource. +By preventing the installation, you can avoid node selector conflicts due to the default custom resource +matching all nodes and your custom resources matching some of the same nodes. + +To prevent configuring the default custom resource, specify the `--set driver.nvidiaDriverCRD.deployDefaultCR=false` +argument when you install the Operator with Helm. + +If the Operator is already installed with the default custom resource and you want to create your own +driver custom resources and apply them to specific nodes, delete the default custom resource. + +> [!NOTE] +> After you delete the default custom resource, your custom resources might not reconcile +> automatically due to a known issue. Refer to the v26.3.0 known issues +> for the workaround. + +### Feature Compatibility + +Driver type + Each NVIDIA driver custom resource specifies the driver type and is one of `gpu`, `vgpu`, or `vgpu-host-manager`. + You can run the data-center driver (`gpu`) on some nodes and the vGPU driver on other nodes. + +GPUDirect RDMA and GPUDirect Storage + Each NVIDIA driver custom resource can specify how to configure GPUDirect RDMA and GPUDirect Storage (GDS). + Refer to GPUDirect RDMA and GPUDirect Storage for the platform support and prerequisites. + +GDRCopy + Each NVIDIA driver custom resource can enable the GDRCopy sidecar container in the driver pod. + +Precompiled and signed drivers + You can run the default driver type that is compiled when the driver pod starts on some nodes + and precompiled driver containers on other nodes. + The precomp-limitations-restrictions for precompiled driver containers apply. + +Preinstalled drivers on nodes + If a node has an NVIDIA GPU driver installed in the operating system, then no driver container runs on the node. + +Support for X86_64 and ARM64 + Each daemon set can run pods and driver containers for the X86_64 and ARM64 architectures. + Refer to the [NVIDIA GPU Driver tags](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/driver/tags) + web page to determine which driver version and operating system combinations support both architectures. + +Custom Driver Parameters + Each NVIDIA driver custom resource can specify custom kernel module parameters by using a ConfigMap. + For more information, refer to Customizing NVIDIA GPU Driver Parameters during Installation (use the `gpu-operator-custom-driver` skill). + +## About the NVIDIA Driver Custom Resource + +An instance of the NVIDIA driver custom resource represents a specific NVIDIA GPU driver type and driver version to install and manage +on nodes. + +The following table describes some of the fields in the custom resource. + +| Field | Description | Default Value | | | | +| --- | --- | --- | --- | --- | --- | +| `metadata.name` | Specifies the name of the NVIDIA driver custom resource. | None | | | | +| `annotations` | Specifies a map of key and value pairs to add as custom annotations to the driver pod. | None | | | | +| `driverType` | Specifies one of the following: | `gpu` to use the NVIDIA data-center GPU driver. | `vgpu` to use the NVIDIA vGPU guest driver. | `vgpu-host-manager` to use the NVIDIA vGPU Manager. | `gpu` | +| `env` | Specifies environment variables to pass to the driver container. | None | | | | +| `gdrcopy.enabled` | Specifies whether to deploy the GDRCopy Driver. When set to `true` the GDRCopy Driver image runs as a sidecar container. | `false` | | | | +| `gds.enabled` | Specifies whether to enable GPUDirect Storage. | `false` | | | | +| `image` | Specifies the driver container image name. | `driver` | | | | +| `imagePullPolicy` | Specifies the policy for kubelet to download the container image. Refer to the Kubernetes documentation for [image pull policy](https://kubernetes.io/docs/concepts/containers/images/#image-pull-policy). | Refer to the Kubernetes documentation. | | | | +| `imagePullSecrets` | Specifies the credentials to provide to the registry if the registry is secured. | None | | | | +| `kernelModuleType` | Specifies the type of the NVIDIA GPU Kernel modules to use. Valid values are `auto` (default), `proprietary`, and `open`. `Auto` means that the recommended kernel module type is chosen based on the GPU devices on the host and the driver branch used. | `auto` | | | | +| `labels` | Specifies a map of key and value pairs to add as custom labels to the driver pod. | None | | | | +| `nodeSelector` | Specifies one or more node labels to match. The driver container is scheduled to nodes that match all the labels. | None. When you do not specify this field, the driver custom resource selects all nodes. | | | | +| `priorityClassName` | Specifies the priority class for the driver pod. | `system-node-critical` | | | | +| `rdma.enabled` | Specifies whether to enable GPUDirect RDMA. | `false` | | | | +| `repository` | Specifies the container registry that contains the driver container. | `nvcr.io/nvidia` | | | | +| `useOpenKernelModules` Deprecated. | This field is deprecated as of v25.3.0 and will be ignored. Use `kernelModuleType` instead. Specifies to use the NVIDIA Open GPU Kernel modules. | `false` | | | | +| `tolerations` | Specifies a set of tolerations to apply to the driver pod. | None | | | | +| `usePrecompiled` | When set to `true`, the Operator deploys a driver container image with a precompiled driver. | `false` | | | | +| `version` | Specifies the GPU driver version to install. For a data-center driver, specify a value like `580.126.20`. If you set `usePrecompiled` to `true`, specify the driver branch, such as `580`. | Refer to the operator-component-matrix. | | | | diff --git a/gpu-operator/.agents/skills/gpu-operator-nvidia-driver/references/install.md b/gpu-operator/.agents/skills/gpu-operator-nvidia-driver/references/install.md new file mode 100644 index 000000000..4377a1426 --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-nvidia-driver/references/install.md @@ -0,0 +1,47 @@ + + + +# Installing the GPU Operator with Driver Custom Resources + +Perform the following steps to install the GPU Operator and use the NVIDIA driver custom resources. + +1. Optional: If you want to run more than one driver type or version in the cluster, + label the worker nodes to identify the driver type and version to install on each node: + + *Example* + + ```console + $ kubectl label node --overwrite driver.version=580.126.20 + ``` + + - To use a mix of driver types, such as vGPU, label nodes for the driver type. + - To use a mix of driver versions, label the nodes for the different versions. + - To use a mix of conventional drivers and precompiled driver containers, label the nodes for the different types. + +1. Install the Operator. + + - Add the NVIDIA Helm repository: + + ```console + $ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \ + && helm repo update + ``` + +> [!NOTE] +> Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). + + - Install the Operator and specify at least the `--set driver.nvidiaDriverCRD.enabled=true` argument: + + ```console + $ helm install --wait --generate-name \ + -n gpu-operator --create-namespace \ + nvidia/gpu-operator \ + --version= \ + --set driver.nvidiaDriverCRD.enabled=true + ``` + + By default, Helm configures a `default` NVIDIA driver custom resource during installation. + To prevent configuring the default custom resource, also specify `--set driver.nvidiaDriverCRD.deployDefaultCR=false`. + +1. Apply NVIDIA driver custom resources manifests to install the NVIDIA GPU driver version, type, and so on for your nodes. + Refer to the sample manifests (see [references/manifests.md](manifests.md)). diff --git a/gpu-operator/.agents/skills/gpu-operator-nvidia-driver/references/manifests.md b/gpu-operator/.agents/skills/gpu-operator-nvidia-driver/references/manifests.md new file mode 100644 index 000000000..d9a8bb065 --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-nvidia-driver/references/manifests.md @@ -0,0 +1,210 @@ + + + +# Sample NVIDIA Driver Manifests + +## One Driver Type and Version on All Nodes + +1. Optional: Remove previously applied node labels. + +1. Create a file, such as `nvd-all.yaml`, with contents like the following: + + ```yaml + apiVersion: nvidia.com/v1alpha1 + kind: NVIDIADriver + metadata: + name: nvidiadriver-sample + spec: + # use pre-compiled packages for NVIDIA driver installation. + usePrecompiled: false + driverType: gpu + repository: nvcr.io/nvidia + image: driver + version: "580.126.20" + imagePullPolicy: IfNotPresent + imagePullSecrets: [] + nodeSelector: {} + manager: {} + rdma: + enabled: false + useHostMofed: false + gds: + enabled: false + # Private mirror repository configuration + repoConfig: + name: "" + # custom ssl key/certificate configuration + certConfig: + name: "" + # vGPU licensing configuration + licensingConfig: + secretName: "" + nlsEnabled: true + # vGPU topology daemon configuration + virtualTopologyConfig: + name: "" + # kernel module configuration for NVIDIA driver + kernelModuleConfig: + name: "" + ``` + +1. Apply the manifest: + + ```console + $ kubectl apply -n gpu-operator -f nvd-all.yaml + ``` + +1. Optional: Monitor the progress: + + ```console + $ kubectl get events -n gpu-operator --sort-by='.lastTimestamp' + ``` + +## Multiple Driver Versions + +1. Label the nodes. + + - On some nodes, apply a label like the following: + + ```console + $ kubectl label node --overwrite driver.config="gold" + ``` + + - On other nodes, apply a label like the following: + + ```console + $ kubectl label node --overwrite driver.config="silver" + ``` + +1. Create a file, such as `nvd-driver-multiple.yaml`, with contents like the following: + + ```yaml + apiVersion: nvidia.com/v1alpha1 + kind: NVIDIADriver + metadata: + name: demo-gold + spec: + driverType: gpu + env: [] + image: driver + imagePullPolicy: IfNotPresent + imagePullSecrets: [] + manager: {} + nodeSelector: + driver.config: "gold" + repository: nvcr.io/nvidia + version: "580.126.20" + --- + apiVersion: nvidia.com/v1alpha1 + kind: NVIDIADriver + metadata: + name: demo-silver + spec: + driverType: gpu + env: [] + image: driver + imagePullPolicy: IfNotPresent + imagePullSecrets: [] + manager: {} + nodeSelector: + driver.config: "silver" + repository: nvcr.io/nvidia + version: "470.141.10" + ``` + +1. Apply the manifest: + + ```console + $ kubectl apply -n gpu-operator -f nvd-driver-multiple.yaml + ``` + +1. Optional: Monitor the progress: + + ```console + $ kubectl get events -n gpu-operator --sort-by='.lastTimestamp' + ``` + +## One Precompiled Driver Container on All Nodes + +1. Optional: Remove previously applied node labels. + +1. Create a file, such as `nvd-precompiled-all.yaml`, with contents like the following: + + ```yaml + apiVersion: nvidia.com/v1alpha1 + kind: NVIDIADriver + metadata: + name: demo-precomp-all + spec: + driverType: gpu + env: [] + image: driver + imagePullPolicy: IfNotPresent + imagePullSecrets: [] + manager: {} + nodeSelector: {} + repository: nvcr.io/nvidia + resources: {} + usePrecompiled: true + version: "580" + ``` + + > [!TIP] + > Because the manifest does not include a `nodeSelector` field, the driver custom + > resource selects all nodes in the cluster that have an NVIDIA GPU. + + 1. Apply the manifest: + + ```console + $ kubectl apply -n gpu-operator -f nvd-precompiled-all.yaml + ``` + +1. Optional: Monitor the progress: + + ```console + $ kubectl get events -n gpu-operator --sort-by='.lastTimestamp' + ``` + +## Precompiled Driver Container on Some Nodes + +1. Label the nodes like the following sample: + + ```console + $ kubectl label node --overwrite driver.precompiled="true" + $ kubectl label node --overwrite driver.version="580" + ``` + +1. Create a file, such as `nvd-precompiled-some.yaml`, with contents like the following: + + ```yaml + apiVersion: nvidia.com/v1alpha1 + kind: NVIDIADriver + metadata: + name: demo-precomp + spec: + driverType: gpu + env: [] + image: driver + imagePullPolicy: IfNotPresent + imagePullSecrets: [] + manager: {} + nodeSelector: + driver.precompiled: "true" + driver.version: "580" + repository: nvcr.io/nvidia + resources: {} + usePrecompiled: true + version: "580" + ``` + +1. Apply the manifest: + + ```console + $ kubectl apply -n gpu-operator -f nvd-precompiled-some.yaml + ``` + +1. Optional: Monitor the progress: + + ```console + $ kubectl get events -n gpu-operator --sort-by='.lastTimestamp' + ``` diff --git a/gpu-operator/.agents/skills/gpu-operator-nvidia-driver/references/upgrade-and-verify.md b/gpu-operator/.agents/skills/gpu-operator-nvidia-driver/references/upgrade-and-verify.md new file mode 100644 index 000000000..17cd3d4ac --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-nvidia-driver/references/upgrade-and-verify.md @@ -0,0 +1,56 @@ + + + +# Upgrading and Verifying the NVIDIA GPU Driver + +## Upgrading the NVIDIA GPU Driver + +You can upgrade the driver version by editing or patching the NVIDIA driver custom resource. + +When you update the custom resource, the Operator performs a rolling update of the pods in the affected daemon set. + +1. Update the `driver.version` field in the driver custom resource: + + ```console + $ kubectl patch nvidiadriver/demo-silver --type='json' \ + -p='[{"op": "replace", "path": "/spec/version", "value": "525.125.06"}]' + ``` + +1. Optional: Monitor the progress: + + ```console + $ kubectl get pods -n gpu-operator -l app.kubernetes.io/component=nvidia-driver + ``` + + *Example Output* + + ```output + NAME READY STATUS RESTARTS AGE + nvidia-gpu-driver-ubuntu20.04-788484b9bb-6zhd9 1/1 Running 0 5m1s + nvidia-gpu-driver-ubuntu22.04-8896c4bf7-7s68q 1/1 Terminating 0 37m + nvidia-gpu-driver-ubuntu22.04-8896c4bf7-jm74l 1/1 Running 0 37m + ``` + +Eventually, the Operator replaces the pods that used the previous driver version with pods that use the updated driver version. + +## Verification + +Confirm that the driver custom resources are applied and the driver pods are running: + +1. List the `NVIDIADriver` custom resources and confirm their state: + + ```console + $ kubectl get nvidiadrivers + ``` + +1. Confirm the driver pods are running on the expected nodes: + + ```console + $ kubectl get pods -n gpu-operator -l app.kubernetes.io/component=nvidia-driver -o wide + ``` + + Each driver pod should report `Running`. If a pod is not progressing, inspect the events: + + ```console + $ kubectl get events -n gpu-operator --sort-by='.lastTimestamp' + ``` diff --git a/gpu-operator/.agents/skills/gpu-operator-timeslicing-gpus/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-timeslicing-gpus/SKILL.md index e7725c752..2f26348c9 100644 --- a/gpu-operator/.agents/skills/gpu-operator-timeslicing-gpus/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-timeslicing-gpus/SKILL.md @@ -20,473 +20,43 @@ tags: # Time-Slicing GPUs in Kubernetes +Oversubscribe NVIDIA GPUs by defining time-sliced replicas with the GPU Operator +and NVIDIA Kubernetes Device Plugin, so multiple workloads share a GPU without +the hardware memory/fault isolation that MIG provides. + ## Prerequisites - A running Kubernetes cluster with NVIDIA GPU worker nodes. - The NVIDIA GPU Operator installed (use the `gpu-operator-install` skill). - NVIDIA GPUs that support time-slicing. Time-slicing shares access to a GPU among workloads without memory or fault isolation; for hardware-isolated partitioning, use MIG (use the `gpu-operator-multiinstance` skill). -## Understanding Time-Slicing GPUs - -The NVIDIA GPU Operator enables oversubscription of GPUs through a set -of extended options for the [NVIDIA Kubernetes Device Plugin](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/k8s-device-plugin). -GPU time-slicing enables workloads that are scheduled on oversubscribed GPUs to -interleave with one another. - -This mechanism for enabling *time-slicing* of -GPUs in Kubernetes enables a system administrator to define a set of -*replicas* for a GPU, each of which can be handed out independently to a -pod to run workloads on. Unlike Multi-Instance GPU (MIG), there is no memory or -fault-isolation between replicas, but for some workloads this is better -than not being able to share at all. Internally, GPU -time-slicing is used to multiplex workloads from -replicas of the same underlying GPU. - -> [!NOTE] -> A typical resource request provides exclusive access to GPUs. -> A request for a time-sliced GPU provides shared access. -> A request for more than one time-sliced GPU does not guarantee that the pod -> receives access to a proportional amount of GPU compute power. - -A request for more than one time-sliced GPU only specifies that the pod -receives access to a GPU that is shared by other pods. -Each pod can run as many processes on the underlying GPU without a limit. -The GPU simply provides an equal share of time to all GPU processes, across -all of the pods. -You can apply a cluster-wide default time-slicing configuration. -You can also apply node-specific configurations. -For example, you can apply a time-slicing configuration to nodes with Tesla-T4 GPUs only -and not modify nodes with other GPU models. - -You can combine the two approaches by applying a cluster-wide default configuration -and then label nodes so that those nodes receive a node-specific configuration. - -### Comparison: Time-Slicing and Multi-Instance GPU - -The latest generations of NVIDIA GPUs provide an operation mode called -Multi-Instance GPU (MIG). MIG allows you to partition a GPU -into several smaller, predefined instances, each of which looks like a -mini-GPU that provides memory and fault isolation at the hardware layer. -You can share access to a GPU by running workloads on one of -these predefined instances instead of the full native GPU. - -MIG support was added to Kubernetes in 2020. Refer to [Supporting MIG in Kubernetes](https://www.google.com/url?q=https://docs.google.com/document/d/1mdgMQ8g7WmaI_XVVRrCvHPFPOMCm5LQD5JefgAh6N8g/edit&sa=D&source=editors&ust=1655578433019961&usg=AOvVaw1F-OezvM-Svwr1lLsdQmu3) -for details on how this works. - -Time-slicing trades the memory and fault-isolation that is provided by MIG -for the ability to share a GPU by a larger number of users. -Time-slicing also provides a way to provide shared access to a GPU for -older generation GPUs that do not support MIG. -However, you can combine MIG and time-slicing to provide shared access to -MIG instances. - -### Support Platforms and Resource Types - -GPU time-slicing can be used with bare-metal applications, virtual machines -with GPU passthrough, and virtual machines with NVIDIA vGPU. - -Currently, the only supported resource types are `nvidia.com/gpu` -and any of the resource types that emerge from configuring a node with -the mixed MIG strategy. - -### Limitations - -- DCGM-Exporter does not support associating metrics to containers when GPU time-slicing is enabled with the NVIDIA Kubernetes Device Plugin. -- The Operator does not monitor changes to a time-slicing config map. - Refer to the **Updating a Time-Slicing Config Map** section. - -### Changes to Node Labels - -In addition to the standard node labels that GPU Feature Discovery (GFD) -applies to nodes, the following label is also applied after you configure -GPU time-slicing for a node: - -```yaml -nvidia.com/.replicas = -``` - -Where `` is the factor by which each resource of `` is oversubscribed. - -Additionally, by default, the `nvidia.com/.product` label is modified: - -```yaml -nvidia.com/.product = -SHARED -``` - -For example, on an NVIDIA DGX A100 machine, depending on the time-slicing configuration, -the labels can be similar to the following example: - -```yaml -nvidia.com/gpu.replicas = 8 -nvidia.com/gpu.product = A100-SXM4-40GB-SHARED -``` - -Using these labels, you can request time-sliced access to a GPU or exclusive access to a GPU -in the same way that you traditionally specify a node selector to request one GPU model over another. -That is, the `-SHARED` product name suffix ensures that you can specify a -node selector to assign pods to nodes with time-sliced GPUs. - -The `migStrategy` configuration option has an effect on the node label for the product name. -When `renameByDefault=false`, the default value, and `migStrategy=single`, both the MIG profile name -and the `-SHARED` suffix are appended to the product name, such as the following example: - -```yaml -nvidia.com/gpu.product = A100-SXM4-40GB-MIG-1g.5gb-SHARED -``` - -If you set `renameByDefault=true`, then the value of the `nvidia.com/gpu.product` node -label is not modified. - -## Configuration - -### About Configuring GPU Time-Slicing - -You configure GPU time-slicing by performing the following high-level steps: - -* Add a config map to the namespace that is used by the GPU operator. -* Configure the cluster policy so that the device plugin uses the config map. -* Apply a label to the nodes that you want to configure for GPU time-slicing. - -On a machine with one GPU, the following config map configures Kubernetes so that -the node advertises four GPU resources. -A machine with two GPUs advertises eight GPUs, and so on. - -### Sample Config Map - -The following table describes the key fields in the config map. - -| Field | Type | Description | -| --- | --- | --- | -| `data.` | string | Specifies the time-slicing configuration name. You can specify multiple configurations if you want to assign node-specific configurations. In the preceding example, the value for `key` is `any`. | -| `flags.migStrategy` | string | Specifies how to label MIG devices for the nodes that receive the time-slicing configuration. Specify one of `none`, `single`, or `mixed`. The default value is `none`. | -| `renameByDefault` | boolean | When set to `true`, each resource is advertised under the name `.shared` instead of ``. For example, if this field is set to `true` and the resource is typically `nvidia.com/gpu`, the nodes that are configured for time-sliced GPU access then advertise the resource as `nvidia.com/gpu.shared`. Setting this field to true can be helpful if you want to schedule pods on GPUs with shared access by specifying `.shared` in the resource request. When this field is set to `false`, the advertised resource name, such as `nvidia.com/gpu`, is not modified. However, label for the product name is suffixed with `-SHARED`. For example, if the output of `kubectl describe node` shows the node label `nvidia.com/gpu.product=Tesla-T4`, then after the node is configured for time-sliced GPU access, the label becomes `nvidia.com/gpu.product=Tesla-T4-SHARED`. In this case, you can specify a node selector that includes the `-SHARED` suffix to schedule pods on GPUs with shared access. The default value is `false`. | -| `failRequestsGreaterThanOne` | boolean | The purpose of this field is to enforce awareness that requesting more than one GPU replica does not result in receiving more proportional access to the GPU. For example, if `4` GPU replicas are available and two pods request `1` GPU each and a third pod requests `2` GPUs, the applications in the three pods have an equal share of GPU compute time. Specifically, the pod that requests `2` GPUs does not receive twice as much compute time as the pods that request `1` GPU. When set to `true`, a resource request for more than one GPU fails with an `UnexpectedAdmissionError`. In this case, you must manually delete the pod, update the resource request, and redeploy. | -| `resources.name` | string | Specifies the resource type to make available with time-sliced access, such as `nvidia.com/gpu`, `nvidia.com/mig-1g.5gb`, and so on. | -| `resources.replicas` | integer | Specifies the number of time-sliced GPU replicas to make available for shared access to GPUs of the specified resource type. | -### Applying One Cluster-Wide Configuration - -Perform the following steps to configure GPU time-slicing if you already installed the GPU operator -and want to apply the same time-slicing configuration on all nodes in the cluster. - -1. Create a file, such as `time-slicing-config-all.yaml`, with contents like the following example: - - ```yaml - apiVersion: v1 - kind: ConfigMap - metadata: - name: time-slicing-config-all - data: - any: |- - version: v1 - flags: - migStrategy: none - sharing: - timeSlicing: - resources: - - name: nvidia.com/gpu - replicas: 4 - ``` - -1. Add the config map to the same namespace as the GPU operator: - - ```console - $ kubectl create -n gpu-operator -f time-slicing-config-all.yaml - ``` - -1. Configure the device plugin with the config map and set the default time-slicing configuration: - - ```console - $ kubectl patch clusterpolicies.nvidia.com/cluster-policy \ - -n gpu-operator --type merge \ - -p '{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config-all", "default": "any"}}}}' - ``` - -1. Optional: Confirm that the `gpu-feature-discovery` and - `nvidia-device-plugin-daemonset` pods restart. - - ```console - $ kubectl get events -n gpu-operator --sort-by='.lastTimestamp' - ``` - - *Example Output* - -Refer to time-slicing-verify. - -### Applying Multiple Node-Specific Configurations - -An alternative to applying one cluster-wide configuration is to specify multiple -time-slicing configurations in the config map and to apply labels node-by-node to -control which configuration is applied to which nodes. - -1. Create a file, such as `time-slicing-config-fine.yaml`, with contents like the following example: - - ```yaml - apiVersion: v1 - kind: ConfigMap - metadata: - name: time-slicing-config-fine - data: - a100-40gb: |- - version: v1 - flags: - migStrategy: mixed - sharing: - timeSlicing: - resources: - - name: nvidia.com/gpu - replicas: 8 - - name: nvidia.com/mig-1g.5gb - replicas: 2 - - name: nvidia.com/mig-2g.10gb - replicas: 2 - - name: nvidia.com/mig-3g.20gb - replicas: 3 - - name: nvidia.com/mig-7g.40gb - replicas: 7 - tesla-t4: |- - version: v1 - flags: - migStrategy: none - sharing: - timeSlicing: - resources: - - name: nvidia.com/gpu - replicas: 4 - ``` - -1. Add the config map to the same namespace as the GPU operator: - - ```console - $ kubectl create -n gpu-operator -f time-slicing-config-fine.yaml - ``` - -1. Configure the device plugin with the config map and set the default time-slicing configuration: - - ```console - $ kubectl patch clusterpolicies.nvidia.com/cluster-policy \ - -n gpu-operator --type merge \ - -p '{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config-fine"}}}}' - ``` - - Because the specification does not include the `devicePlugin.config.default` field, - when the device plugin pods redeploy, they do not automatically apply the time-slicing - configuration to all nodes. - -1. Optional: Confirm that the `gpu-feature-discovery` and - `nvidia-device-plugin-daemonset` pods restart. - - ```console - $ kubectl get events -n gpu-operator --sort-by='.lastTimestamp' - ``` - - *Example Output* - -1. Apply a label to the nodes by running one or more of the following commands: - - * Apply a label to nodes one-by-one by specifying the node name: - - ```console - $ kubectl label node nvidia.com/device-plugin.config=tesla-t4 - ``` - - * Apply a label to several nodes at one time by specifying a label selector: - - ```console - $ kubectl label node \ - --selector=nvidia.com/gpu.product=Tesla-T4 \ - nvidia.com/device-plugin.config=tesla-t4 - ``` - -Refer to time-slicing-verify. - -### Configuring Time-Slicing Before Installing the NVIDIA GPU Operator - -You can enable time-slicing with the NVIDIA GPU Operator by passing the -`devicePlugin.config.name=` parameter during installation. - -Perform the following steps to configure time-slicing before installing the operator: - -1. Create the namespace for the operator: - - ```console - $ kubectl create namespace gpu-operator - ``` - -1. Create a file, such as `time-slicing-config.yaml`, with the config map contents. - - Refer to the **Applying One Cluster-Wide Configuration** or **Applying Multiple Node-Specific Configurations** sections. - -1. Add the config map to the same namespace as the GPU operator: - - ```console - $ kubectl create -f time-slicing-config.yaml - ``` - -> [!NOTE] -> Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). - -1. Install the operator with Helm: - - ```console - $ helm install gpu-operator nvidia/gpu-operator \ - -n gpu-operator \ - --version= \ - --set devicePlugin.config.name=time-slicing-config - ``` - -1. Refer to either the **Applying One Cluster-Wide Configuration** or **Applying Multiple Node-Specific Configurations** section and perform the following tasks: - - * Configure the device plugin by running the `kubectl patch` command. - * Apply labels to nodes if you added a config map with node-specific configurations. - -After installation, refer to time-slicing-verify. - -### Updating a Time-Slicing Config Map - -The Operator does not monitor the time-slicing config maps. -As a result, if you modify a config map, the device plugin pods do not restart and do not apply the modified configuration. - -To apply the modified config map, manually restart the device plugin pods: - -```console -$ kubectl rollout restart -n gpu-operator daemonset/nvidia-device-plugin-daemonset -``` - -Currently running workloads are not affected and continue to run, though NVIDIA recommends performing the restart during a maintenance period. - -## Verifying the GPU Time-Slicing Configuration - -Perform the following steps to verify that the time-slicing configuration is applied successfully: - -1. Confirm that the node advertises additional GPU resources: - - ```console - $ kubectl describe node - ``` - - *Example Output* - - The example output varies according to the GPU in your node and the configuration - that you apply. - - The following output applies when `renameByDefault` is set to `false`, - the default value. - The key considerations are as follows: - - * The `nvidia.com/gpu.count` label reports the number of physical GPUs in the machine. - * The `nvidia.com/gpu.product` label includes a `-SHARED` suffix to the product name. - * The `nvidia.com/gpu.replicas` label matches the reported capacity. - - ```output - ... - Labels: - nvidia.com/gpu.count=4 - nvidia.com/gpu.product=Tesla-T4-SHARED - nvidia.com/gpu.replicas=4 - Capacity: - nvidia.com/gpu: 16 - ... - Allocatable: - nvidia.com/gpu: 16 - ... - ``` - - The following output applies when `renameByDefault` is set to `true`. - The key considerations are as follows: - - * The `nvidia.com/gpu.count` label reports the number of physical GPUs in the machine. - * The `nvidia.com/gpu` capacity reports `0`. - * The `nvidia.com/gpu.shared` capacity equals the number of physical GPUs multiplied by the - specified number of GPU replicas to create. - - ```output - ... - Labels: - nvidia.com/gpu.count=4 - nvidia.com/gpu.product=Tesla-T4 - nvidia.com/gpu.replicas=4 - Capacity: - nvidia.com/gpu: 0 - nvidia.com/gpu.shared: 16 - ... - Allocatable: - nvidia.com/gpu: 0 - nvidia.com/gpu.shared: 16 - ... - ``` - -1. Optional: Deploy a workload to validate GPU time-slicing: - - * Create a file, such as `time-slicing-verification.yaml`, with contents like the following: - - ```yaml - apiVersion: apps/v1 - kind: Deployment - metadata: - name: time-slicing-verification - labels: - app: time-slicing-verification - spec: - replicas: 5 - selector: - matchLabels: - app: time-slicing-verification - template: - metadata: - labels: - app: time-slicing-verification - spec: - tolerations: - - key: nvidia.com/gpu - operator: Exists - effect: NoSchedule - hostPID: true - containers: - - name: cuda-sample-vector-add - image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04" - command: ["/bin/bash", "-c", "--"] - args: - - while true; do /cuda-samples/vectorAdd; done - resources: - limits: - nvidia.com/gpu: 1 - ``` - - * Create the deployment with multiple replicas: - - ```console - $ kubectl apply -f time-slicing-verification.yaml - ``` - - * Verify that all five replicas are running: - - ```console - $ kubectl get pods - ``` - - *Example Output* - - * View the logs from one of the pods: - - ```console - $ kubectl logs deploy/time-slicing-verification - ``` +## Activation - *Example Output* +Do this first: identify which phase the user's request maps to in the Phases +table below, then **read the corresponding `references/.md` file before +acting**. All config map manifests, the field-reference table, `kubectl`/Helm +command sequences, and expected node-label output live only in those reference +files — do not improvise commands from this dispatch layer. - * Stop the deployment: +## Phases - ```console - $ kubectl delete -f time-slicing-verification.yaml - ``` +| Phase | Summary | Reference | +|-------|---------|-----------| +| Concepts | What GPU time-slicing is, how it differs from MIG, supported platforms/resource types, limitations, and the node-label changes (`.replicas`, `-SHARED` suffix). | [references/concepts.md](references/concepts.md) | +| Configuration | The config map field reference plus all four ways to apply it: one cluster-wide config, multiple node-specific configs, configuring before install, and updating an existing config map. | [references/configuration.md](references/configuration.md) | +| Verification | Confirm the node advertises the additional GPU resources (for both `renameByDefault` modes) and deploy a multi-replica sample workload to validate sharing. | [references/verification.md](references/verification.md) | - *Example Output* +## Hard rules (apply across all phases) - ```output - deployment.apps "time-slicing-verification" deleted - ``` +- Time-slicing provides NO memory or fault isolation between replicas; use MIG when isolation is required. +- Requesting more than one time-sliced GPU does NOT grant proportional compute; set `failRequestsGreaterThanOne=true` to enforce awareness of this. +- The Operator does not monitor time-slicing config maps; after editing one, manually restart the device plugin daemonset to apply changes. +- The config map must live in the same namespace as the GPU Operator. +- Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). Never hardcode a version. -## References +## Verification -- [Blog post on GPU sharing in Kubernetes](https://developer.nvidia.com/blog/improving-gpu-utilization-in-kubernetes). -- [NVIDIA Kubernetes Device Plugin](https://github.com/NVIDIA/k8s-device-plugin) repository on GitHub. +After applying a config, confirm the node's `nvidia.com/gpu` (or +`nvidia.com/gpu.shared`) capacity reflects the configured replica count, then +optionally deploy the multi-replica sample workload. Exact commands and +expected label output are in [references/verification.md](references/verification.md). diff --git a/gpu-operator/.agents/skills/gpu-operator-timeslicing-gpus/references/concepts.md b/gpu-operator/.agents/skills/gpu-operator-timeslicing-gpus/references/concepts.md new file mode 100644 index 000000000..02d2015a9 --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-timeslicing-gpus/references/concepts.md @@ -0,0 +1,113 @@ + + + +# Understanding Time-Slicing GPUs + +The NVIDIA GPU Operator enables oversubscription of GPUs through a set +of extended options for the [NVIDIA Kubernetes Device Plugin](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/k8s-device-plugin). +GPU time-slicing enables workloads that are scheduled on oversubscribed GPUs to +interleave with one another. + +This mechanism for enabling *time-slicing* of +GPUs in Kubernetes enables a system administrator to define a set of +*replicas* for a GPU, each of which can be handed out independently to a +pod to run workloads on. Unlike Multi-Instance GPU (MIG), there is no memory or +fault-isolation between replicas, but for some workloads this is better +than not being able to share at all. Internally, GPU +time-slicing is used to multiplex workloads from +replicas of the same underlying GPU. + +> [!NOTE] +> A typical resource request provides exclusive access to GPUs. +> A request for a time-sliced GPU provides shared access. +> A request for more than one time-sliced GPU does not guarantee that the pod +> receives access to a proportional amount of GPU compute power. + +A request for more than one time-sliced GPU only specifies that the pod +receives access to a GPU that is shared by other pods. +Each pod can run as many processes on the underlying GPU without a limit. +The GPU simply provides an equal share of time to all GPU processes, across +all of the pods. +You can apply a cluster-wide default time-slicing configuration. +You can also apply node-specific configurations. +For example, you can apply a time-slicing configuration to nodes with Tesla-T4 GPUs only +and not modify nodes with other GPU models. + +You can combine the two approaches by applying a cluster-wide default configuration +and then label nodes so that those nodes receive a node-specific configuration. + +## Comparison: Time-Slicing and Multi-Instance GPU + +The latest generations of NVIDIA GPUs provide an operation mode called +Multi-Instance GPU (MIG). MIG allows you to partition a GPU +into several smaller, predefined instances, each of which looks like a +mini-GPU that provides memory and fault isolation at the hardware layer. +You can share access to a GPU by running workloads on one of +these predefined instances instead of the full native GPU. + +MIG support was added to Kubernetes in 2020. Refer to [Supporting MIG in Kubernetes](https://www.google.com/url?q=https://docs.google.com/document/d/1mdgMQ8g7WmaI_XVVRrCvHPFPOMCm5LQD5JefgAh6N8g/edit&sa=D&source=editors&ust=1655578433019961&usg=AOvVaw1F-OezvM-Svwr1lLsdQmu3) +for details on how this works. + +Time-slicing trades the memory and fault-isolation that is provided by MIG +for the ability to share a GPU by a larger number of users. +Time-slicing also provides a way to provide shared access to a GPU for +older generation GPUs that do not support MIG. +However, you can combine MIG and time-slicing to provide shared access to +MIG instances. + +## Support Platforms and Resource Types + +GPU time-slicing can be used with bare-metal applications, virtual machines +with GPU passthrough, and virtual machines with NVIDIA vGPU. + +Currently, the only supported resource types are `nvidia.com/gpu` +and any of the resource types that emerge from configuring a node with +the mixed MIG strategy. + +## Limitations + +- DCGM-Exporter does not support associating metrics to containers when GPU time-slicing is enabled with the NVIDIA Kubernetes Device Plugin. +- The Operator does not monitor changes to a time-slicing config map. + Refer to the **Updating a Time-Slicing Config Map** section. + +## Changes to Node Labels + +In addition to the standard node labels that GPU Feature Discovery (GFD) +applies to nodes, the following label is also applied after you configure +GPU time-slicing for a node: + +```yaml +nvidia.com/.replicas = +``` + +Where `` is the factor by which each resource of `` is oversubscribed. + +Additionally, by default, the `nvidia.com/.product` label is modified: + +```yaml +nvidia.com/.product = -SHARED +``` + +For example, on an NVIDIA DGX A100 machine, depending on the time-slicing configuration, +the labels can be similar to the following example: + +```yaml +nvidia.com/gpu.replicas = 8 +nvidia.com/gpu.product = A100-SXM4-40GB-SHARED +``` + +Using these labels, you can request time-sliced access to a GPU or exclusive access to a GPU +in the same way that you traditionally specify a node selector to request one GPU model over another. +That is, the `-SHARED` product name suffix ensures that you can specify a +node selector to assign pods to nodes with time-sliced GPUs. + +The `migStrategy` configuration option has an effect on the node label for the product name. +When `renameByDefault=false`, the default value, and `migStrategy=single`, both the MIG profile name +and the `-SHARED` suffix are appended to the product name, such as the following example: + +```yaml +nvidia.com/gpu.product = A100-SXM4-40GB-MIG-1g.5gb-SHARED +``` + +If you set `renameByDefault=true`, then the value of the `nvidia.com/gpu.product` node +label is not modified. diff --git a/gpu-operator/.agents/skills/gpu-operator-timeslicing-gpus/references/configuration.md b/gpu-operator/.agents/skills/gpu-operator-timeslicing-gpus/references/configuration.md new file mode 100644 index 000000000..2d0587829 --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-timeslicing-gpus/references/configuration.md @@ -0,0 +1,222 @@ + + + +# Configuring GPU Time-Slicing + +Throughout, replace `` with your target GPU Operator release. + +## About Configuring GPU Time-Slicing + +You configure GPU time-slicing by performing the following high-level steps: + +* Add a config map to the namespace that is used by the GPU operator. +* Configure the cluster policy so that the device plugin uses the config map. +* Apply a label to the nodes that you want to configure for GPU time-slicing. + +On a machine with one GPU, the following config map configures Kubernetes so that +the node advertises four GPU resources. +A machine with two GPUs advertises eight GPUs, and so on. + +## Sample Config Map + +The following table describes the key fields in the config map. + +| Field | Type | Description | +| --- | --- | --- | +| `data.` | string | Specifies the time-slicing configuration name. You can specify multiple configurations if you want to assign node-specific configurations. In the preceding example, the value for `key` is `any`. | +| `flags.migStrategy` | string | Specifies how to label MIG devices for the nodes that receive the time-slicing configuration. Specify one of `none`, `single`, or `mixed`. The default value is `none`. | +| `renameByDefault` | boolean | When set to `true`, each resource is advertised under the name `.shared` instead of ``. For example, if this field is set to `true` and the resource is typically `nvidia.com/gpu`, the nodes that are configured for time-sliced GPU access then advertise the resource as `nvidia.com/gpu.shared`. Setting this field to true can be helpful if you want to schedule pods on GPUs with shared access by specifying `.shared` in the resource request. When this field is set to `false`, the advertised resource name, such as `nvidia.com/gpu`, is not modified. However, label for the product name is suffixed with `-SHARED`. For example, if the output of `kubectl describe node` shows the node label `nvidia.com/gpu.product=Tesla-T4`, then after the node is configured for time-sliced GPU access, the label becomes `nvidia.com/gpu.product=Tesla-T4-SHARED`. In this case, you can specify a node selector that includes the `-SHARED` suffix to schedule pods on GPUs with shared access. The default value is `false`. | +| `failRequestsGreaterThanOne` | boolean | The purpose of this field is to enforce awareness that requesting more than one GPU replica does not result in receiving more proportional access to the GPU. For example, if `4` GPU replicas are available and two pods request `1` GPU each and a third pod requests `2` GPUs, the applications in the three pods have an equal share of GPU compute time. Specifically, the pod that requests `2` GPUs does not receive twice as much compute time as the pods that request `1` GPU. When set to `true`, a resource request for more than one GPU fails with an `UnexpectedAdmissionError`. In this case, you must manually delete the pod, update the resource request, and redeploy. | +| `resources.name` | string | Specifies the resource type to make available with time-sliced access, such as `nvidia.com/gpu`, `nvidia.com/mig-1g.5gb`, and so on. | +| `resources.replicas` | integer | Specifies the number of time-sliced GPU replicas to make available for shared access to GPUs of the specified resource type. | + +## Applying One Cluster-Wide Configuration + +Perform the following steps to configure GPU time-slicing if you already installed the GPU operator +and want to apply the same time-slicing configuration on all nodes in the cluster. + +1. Create a file, such as `time-slicing-config-all.yaml`, with contents like the following example: + + ```yaml + apiVersion: v1 + kind: ConfigMap + metadata: + name: time-slicing-config-all + data: + any: |- + version: v1 + flags: + migStrategy: none + sharing: + timeSlicing: + resources: + - name: nvidia.com/gpu + replicas: 4 + ``` + +1. Add the config map to the same namespace as the GPU operator: + + ```console + $ kubectl create -n gpu-operator -f time-slicing-config-all.yaml + ``` + +1. Configure the device plugin with the config map and set the default time-slicing configuration: + + ```console + $ kubectl patch clusterpolicies.nvidia.com/cluster-policy \ + -n gpu-operator --type merge \ + -p '{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config-all", "default": "any"}}}}' + ``` + +1. Optional: Confirm that the `gpu-feature-discovery` and + `nvidia-device-plugin-daemonset` pods restart. + + ```console + $ kubectl get events -n gpu-operator --sort-by='.lastTimestamp' + ``` + + *Example Output* + +Refer to the verification reference (see [references/verification.md](verification.md)). + +## Applying Multiple Node-Specific Configurations + +An alternative to applying one cluster-wide configuration is to specify multiple +time-slicing configurations in the config map and to apply labels node-by-node to +control which configuration is applied to which nodes. + +1. Create a file, such as `time-slicing-config-fine.yaml`, with contents like the following example: + + ```yaml + apiVersion: v1 + kind: ConfigMap + metadata: + name: time-slicing-config-fine + data: + a100-40gb: |- + version: v1 + flags: + migStrategy: mixed + sharing: + timeSlicing: + resources: + - name: nvidia.com/gpu + replicas: 8 + - name: nvidia.com/mig-1g.5gb + replicas: 2 + - name: nvidia.com/mig-2g.10gb + replicas: 2 + - name: nvidia.com/mig-3g.20gb + replicas: 3 + - name: nvidia.com/mig-7g.40gb + replicas: 7 + tesla-t4: |- + version: v1 + flags: + migStrategy: none + sharing: + timeSlicing: + resources: + - name: nvidia.com/gpu + replicas: 4 + ``` + +1. Add the config map to the same namespace as the GPU operator: + + ```console + $ kubectl create -n gpu-operator -f time-slicing-config-fine.yaml + ``` + +1. Configure the device plugin with the config map and set the default time-slicing configuration: + + ```console + $ kubectl patch clusterpolicies.nvidia.com/cluster-policy \ + -n gpu-operator --type merge \ + -p '{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config-fine"}}}}' + ``` + + Because the specification does not include the `devicePlugin.config.default` field, + when the device plugin pods redeploy, they do not automatically apply the time-slicing + configuration to all nodes. + +1. Optional: Confirm that the `gpu-feature-discovery` and + `nvidia-device-plugin-daemonset` pods restart. + + ```console + $ kubectl get events -n gpu-operator --sort-by='.lastTimestamp' + ``` + + *Example Output* + +1. Apply a label to the nodes by running one or more of the following commands: + + * Apply a label to nodes one-by-one by specifying the node name: + + ```console + $ kubectl label node nvidia.com/device-plugin.config=tesla-t4 + ``` + + * Apply a label to several nodes at one time by specifying a label selector: + + ```console + $ kubectl label node \ + --selector=nvidia.com/gpu.product=Tesla-T4 \ + nvidia.com/device-plugin.config=tesla-t4 + ``` + +Refer to the verification reference (see [references/verification.md](verification.md)). + +## Configuring Time-Slicing Before Installing the NVIDIA GPU Operator + +You can enable time-slicing with the NVIDIA GPU Operator by passing the +`devicePlugin.config.name=` parameter during installation. + +Perform the following steps to configure time-slicing before installing the operator: + +1. Create the namespace for the operator: + + ```console + $ kubectl create namespace gpu-operator + ``` + +1. Create a file, such as `time-slicing-config.yaml`, with the config map contents. + + Refer to the **Applying One Cluster-Wide Configuration** or **Applying Multiple Node-Specific Configurations** sections. + +1. Add the config map to the same namespace as the GPU operator: + + ```console + $ kubectl create -f time-slicing-config.yaml + ``` + +> [!NOTE] +> Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). + +1. Install the operator with Helm: + + ```console + $ helm install gpu-operator nvidia/gpu-operator \ + -n gpu-operator \ + --version= \ + --set devicePlugin.config.name=time-slicing-config + ``` + +1. Refer to either the **Applying One Cluster-Wide Configuration** or **Applying Multiple Node-Specific Configurations** section and perform the following tasks: + + * Configure the device plugin by running the `kubectl patch` command. + * Apply labels to nodes if you added a config map with node-specific configurations. + +After installation, refer to the verification reference (see [references/verification.md](verification.md)). + +## Updating a Time-Slicing Config Map + +The Operator does not monitor the time-slicing config maps. +As a result, if you modify a config map, the device plugin pods do not restart and do not apply the modified configuration. + +To apply the modified config map, manually restart the device plugin pods: + +```console +$ kubectl rollout restart -n gpu-operator daemonset/nvidia-device-plugin-daemonset +``` + +Currently running workloads are not affected and continue to run, though NVIDIA recommends performing the restart during a maintenance period. diff --git a/gpu-operator/.agents/skills/gpu-operator-timeslicing-gpus/references/verification.md b/gpu-operator/.agents/skills/gpu-operator-timeslicing-gpus/references/verification.md new file mode 100644 index 000000000..8163ff2c9 --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-timeslicing-gpus/references/verification.md @@ -0,0 +1,139 @@ + + + +# Verifying the GPU Time-Slicing Configuration + +Perform the following steps to verify that the time-slicing configuration is applied successfully: + +1. Confirm that the node advertises additional GPU resources: + + ```console + $ kubectl describe node + ``` + + *Example Output* + + The example output varies according to the GPU in your node and the configuration + that you apply. + + The following output applies when `renameByDefault` is set to `false`, + the default value. + The key considerations are as follows: + + * The `nvidia.com/gpu.count` label reports the number of physical GPUs in the machine. + * The `nvidia.com/gpu.product` label includes a `-SHARED` suffix to the product name. + * The `nvidia.com/gpu.replicas` label matches the reported capacity. + + ```output + ... + Labels: + nvidia.com/gpu.count=4 + nvidia.com/gpu.product=Tesla-T4-SHARED + nvidia.com/gpu.replicas=4 + Capacity: + nvidia.com/gpu: 16 + ... + Allocatable: + nvidia.com/gpu: 16 + ... + ``` + + The following output applies when `renameByDefault` is set to `true`. + The key considerations are as follows: + + * The `nvidia.com/gpu.count` label reports the number of physical GPUs in the machine. + * The `nvidia.com/gpu` capacity reports `0`. + * The `nvidia.com/gpu.shared` capacity equals the number of physical GPUs multiplied by the + specified number of GPU replicas to create. + + ```output + ... + Labels: + nvidia.com/gpu.count=4 + nvidia.com/gpu.product=Tesla-T4 + nvidia.com/gpu.replicas=4 + Capacity: + nvidia.com/gpu: 0 + nvidia.com/gpu.shared: 16 + ... + Allocatable: + nvidia.com/gpu: 0 + nvidia.com/gpu.shared: 16 + ... + ``` + +1. Optional: Deploy a workload to validate GPU time-slicing: + + * Create a file, such as `time-slicing-verification.yaml`, with contents like the following: + + ```yaml + apiVersion: apps/v1 + kind: Deployment + metadata: + name: time-slicing-verification + labels: + app: time-slicing-verification + spec: + replicas: 5 + selector: + matchLabels: + app: time-slicing-verification + template: + metadata: + labels: + app: time-slicing-verification + spec: + tolerations: + - key: nvidia.com/gpu + operator: Exists + effect: NoSchedule + hostPID: true + containers: + - name: cuda-sample-vector-add + image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04" + command: ["/bin/bash", "-c", "--"] + args: + - while true; do /cuda-samples/vectorAdd; done + resources: + limits: + nvidia.com/gpu: 1 + ``` + + * Create the deployment with multiple replicas: + + ```console + $ kubectl apply -f time-slicing-verification.yaml + ``` + + * Verify that all five replicas are running: + + ```console + $ kubectl get pods + ``` + + *Example Output* + + * View the logs from one of the pods: + + ```console + $ kubectl logs deploy/time-slicing-verification + ``` + + *Example Output* + + * Stop the deployment: + + ```console + $ kubectl delete -f time-slicing-verification.yaml + ``` + + *Example Output* + + ```output + deployment.apps "time-slicing-verification" deleted + ``` + +## References + +- [Blog post on GPU sharing in Kubernetes](https://developer.nvidia.com/blog/improving-gpu-utilization-in-kubernetes). +- [NVIDIA Kubernetes Device Plugin](https://github.com/NVIDIA/k8s-device-plugin) repository on GitHub. From ad7d0f52f5bf7b36842587c4c5494ea4e440705a Mon Sep 17 00:00:00 2001 From: Andrew Chen Date: Mon, 1 Jun 2026 21:48:57 -0700 Subject: [PATCH 09/13] refactor(skills): info-hiding restructure batch 1 (5 small procedural skills) Restructure 5 procedural GPU Operator skills to the dispatch-layer information-hiding pattern (matching the prior 7-skill pass): thin SKILL.md (frontmatter + intro + Prerequisites + Activation + Phases table + cross-cutting hard rules + Verification, all <200 lines) with every step-by-step command sequence, manifest, and field detail moved into phase-specific references/*.md. Skills: custom-driver, install-service-mesh, uninstalling-nvidia, install-http-proxy, install-outdated-kernels. All five pass the close-your-eyes test (leaked cmds, bash blocks, and line count within thresholds). Verified-fix content (prerequisites, verification, placeholders, fixed cross-refs) preserved. Co-Authored-By: Claude Opus 4.8 (1M context) Signed-off-by: Andrew Chen --- .../gpu-operator-custom-driver/SKILL.md | 114 ++++-------------- .../references/configure.md | 45 +++++++ .../references/example-nvidia-uvm.md | 59 +++++++++ .../gpu-operator-install-http-proxy/SKILL.md | 106 +++++----------- .../references/configure-and-deploy.md | 62 ++++++++++ .../references/openshift.md | 9 ++ .../SKILL.md | 111 ++++------------- .../references/workaround.md | 98 +++++++++++++++ .../SKILL.md | 61 ++++------ .../references/considerations.md | 20 +++ .../references/disable-injection.md | 30 +++++ .../gpu-operator-uninstalling-nvidia/SKILL.md | 91 ++++---------- .../references/crd-cleanup.md | 24 ++++ .../references/procedure.md | 59 +++++++++ 14 files changed, 530 insertions(+), 359 deletions(-) create mode 100644 gpu-operator/.agents/skills/gpu-operator-custom-driver/references/configure.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-custom-driver/references/example-nvidia-uvm.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-install-http-proxy/references/configure-and-deploy.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-install-http-proxy/references/openshift.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-install-outdated-kernels/references/workaround.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-install-service-mesh/references/considerations.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-install-service-mesh/references/disable-injection.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-uninstalling-nvidia/references/crd-cleanup.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-uninstalling-nvidia/references/procedure.md diff --git a/gpu-operator/.agents/skills/gpu-operator-custom-driver/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-custom-driver/SKILL.md index c345bc6d3..e9b55a9f6 100644 --- a/gpu-operator/.agents/skills/gpu-operator-custom-driver/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-custom-driver/SKILL.md @@ -20,11 +20,11 @@ tags: # Customizing NVIDIA GPU Driver Parameters during Installation -The NVIDIA Driver kernel modules accept a number of parameters which can be used to customize the behavior of the driver. -By default, the GPU Operator loads the kernel modules with default values. -On a machine with the driver already installed, you can list the parameter names and values with the `cat /proc/driver/nvidia/params` command. -You can pass custom parameters to the kernel modules that get loaded as part of the -NVIDIA Driver installation (`nvidia`, `nvidia-modeset`, `nvidia-uvm`, and `nvidia-peermem`). +The NVIDIA Driver kernel modules accept a number of parameters that customize +driver behavior. By default, the GPU Operator loads the kernel modules +(`nvidia`, `nvidia-modeset`, `nvidia-uvm`, and `nvidia-peermem`) with default +values. This skill shows how to supply custom kernel-module parameters through +a `ConfigMap` referenced at install time. ## Prerequisites @@ -32,94 +32,30 @@ NVIDIA Driver installation (`nvidia`, `nvidia-modeset`, `nvidia-uvm`, and `nvidi - The NVIDIA GPU Operator installed (use the `gpu-operator-install` skill). - The GPU Operator deploys the NVIDIA driver as a container (`driver.enabled=true`, the default). Custom kernel-module parameters do not apply when you use pre-installed host drivers. -## Configure Custom Driver Parameters +## Activation -To pass custom parameters, execute the following steps. +Do this first: identify which phase the user's request maps to in the Phases +table below, then **read the corresponding `references/.md` file before +acting**. All command sequences, manifest contents, and verification output +live only in those reference files — do not improvise commands from this +dispatch layer. -1. Create a configuration file named `.conf`, where `` is the name of the kernel module the parameters are for. - The file should contain parameters as key-value pairs -- one parameter per line. +## Phases - The following example shows the GPU firmware logging parameter being passed to the `nvidia` module. +| Phase | Summary | Reference | +|-------|---------|-----------| +| Configure | Create a `.conf` parameter file, wrap it in a `ConfigMap`, and install the GPU Operator with `driver.kernelModuleConfig.name` pointing at it. | [references/configure.md](references/configure.md) | +| Example (`nvidia-uvm`) | A worked example that disables Heterogeneous Memory Management (HMM) via `uvm_disable_hmm`, plus how to verify the parameter on the node. | [references/example-nvidia-uvm.md](references/example-nvidia-uvm.md) | - ```console - $ cat nvidia.conf - NVreg_EnableGpuFirmwareLogs=2 - ``` +## Hard rules (apply across all phases) -1. Create a `ConfigMap` for the configuration file. - If multiple modules are being configured, pass multiple files when creating the `ConfigMap`. +- The `.conf` filename must match the kernel module the parameters apply to (`nvidia`, `nvidia-modeset`, `nvidia-uvm`, or `nvidia-peermem`). +- Parameters are key-value pairs, one per line. +- Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). Never hardcode a version. - ```console - $ kubectl create configmap kernel-module-params -n gpu-operator --from-file=nvidia.conf=./nvidia.conf - ``` +## Verification -> [!NOTE] -> Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). - -1. Install the GPU Operator and set `driver.kernelModuleConfig.name` to the name of the `ConfigMap` - containing the kernel module parameters. - - ```console - $ helm install --wait --generate-name \ - -n gpu-operator --create-namespace \ - nvidia/gpu-operator \ - --version= \ - --set driver.kernelModuleConfig.name="kernel-module-params" - ``` - -### Example using `nvidia-uvm` module - -This example shows the Heterogeneous Memory Management (HMM) being disabled in the `nvidia-uvm` module. -Refer to [Simplifying GPU Application Development with Heterogeneous Memory Management](https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/) for more information about HMM. - -1. Create a configuration file named `nvidia-uvm.conf`: - - ```console - $ cat nvidia-uvm.conf - uvm_disable_hmm=1 - ``` - -1. Create a `ConfigMap` for the configuration file. - If multiple modules are being configured, pass multiple files when creating the `ConfigMap`. - - ```console - $ kubectl create configmap kernel-module-params -n gpu-operator --from-file=nvidia-uvm.conf=./nvidia-uvm.conf - ``` - -1. Install the GPU Operator and set `driver.kernelModuleConfig.name` to the name of the `ConfigMap` - containing the kernel module parameters. - - ```console - $ helm install --wait --generate-name \ - -n gpu-operator --create-namespace \ - nvidia/gpu-operator \ - --version= \ - --set driver.kernelModuleConfig.name="kernel-module-params" - ``` - -1. Verify the parameter has been correctly applied, go to `/sys/module/nvidia_uvm/parameters/` on the node: - - ```console - $ ls /sys/module/nvidia_uvm/parameters/ - ``` - - *Example Output* - - ```output - ... - uvm_disable_hmm uvm_perf_access_counter_migration_enable uvm_perf_prefetch_min_faults - uvm_downgrade_force_membar_sys uvm_perf_access_counter_threshold uvm_perf_prefetch_threshold - ... - ``` - - Then check the value of the parameter: - - ```console - $ cat /sys/module/nvidia_uvm/parameters/uvm_disable_hmm - ``` - - *Example Output* - - ```output - Y - ``` +Inspect the applied parameter on a GPU node under +`/sys/module//parameters/`. The worked example in +[references/example-nvidia-uvm.md](references/example-nvidia-uvm.md) shows the +exact commands and expected output. diff --git a/gpu-operator/.agents/skills/gpu-operator-custom-driver/references/configure.md b/gpu-operator/.agents/skills/gpu-operator-custom-driver/references/configure.md new file mode 100644 index 000000000..9bceb76ce --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-custom-driver/references/configure.md @@ -0,0 +1,45 @@ + + + +# Configure Custom Driver Parameters + +The NVIDIA Driver kernel modules accept a number of parameters that customize +driver behavior. By default, the GPU Operator loads the kernel modules with +default values. On a machine with the driver already installed, you can list +the parameter names and values with the `cat /proc/driver/nvidia/params` +command. You can pass custom parameters to the kernel modules that get loaded +as part of the NVIDIA Driver installation (`nvidia`, `nvidia-modeset`, +`nvidia-uvm`, and `nvidia-peermem`). + +To pass custom parameters, execute the following steps. + +1. Create a configuration file named `.conf`, where `` is the name of the kernel module the parameters are for. + The file should contain parameters as key-value pairs -- one parameter per line. + + The following example shows the GPU firmware logging parameter being passed to the `nvidia` module. + + ```console + $ cat nvidia.conf + NVreg_EnableGpuFirmwareLogs=2 + ``` + +1. Create a `ConfigMap` for the configuration file. + If multiple modules are being configured, pass multiple files when creating the `ConfigMap`. + + ```console + $ kubectl create configmap kernel-module-params -n gpu-operator --from-file=nvidia.conf=./nvidia.conf + ``` + + > [!NOTE] + > Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). + +1. Install the GPU Operator and set `driver.kernelModuleConfig.name` to the name of the `ConfigMap` + containing the kernel module parameters. + + ```console + $ helm install --wait --generate-name \ + -n gpu-operator --create-namespace \ + nvidia/gpu-operator \ + --version= \ + --set driver.kernelModuleConfig.name="kernel-module-params" + ``` diff --git a/gpu-operator/.agents/skills/gpu-operator-custom-driver/references/example-nvidia-uvm.md b/gpu-operator/.agents/skills/gpu-operator-custom-driver/references/example-nvidia-uvm.md new file mode 100644 index 000000000..d23588bc8 --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-custom-driver/references/example-nvidia-uvm.md @@ -0,0 +1,59 @@ + + + +# Example: `nvidia-uvm` module + +This example shows the Heterogeneous Memory Management (HMM) being disabled in the `nvidia-uvm` module. +Refer to [Simplifying GPU Application Development with Heterogeneous Memory Management](https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/) for more information about HMM. + +1. Create a configuration file named `nvidia-uvm.conf`: + + ```console + $ cat nvidia-uvm.conf + uvm_disable_hmm=1 + ``` + +1. Create a `ConfigMap` for the configuration file. + If multiple modules are being configured, pass multiple files when creating the `ConfigMap`. + + ```console + $ kubectl create configmap kernel-module-params -n gpu-operator --from-file=nvidia-uvm.conf=./nvidia-uvm.conf + ``` + +1. Install the GPU Operator and set `driver.kernelModuleConfig.name` to the name of the `ConfigMap` + containing the kernel module parameters. + + ```console + $ helm install --wait --generate-name \ + -n gpu-operator --create-namespace \ + nvidia/gpu-operator \ + --version= \ + --set driver.kernelModuleConfig.name="kernel-module-params" + ``` + +1. Verify the parameter has been correctly applied, go to `/sys/module/nvidia_uvm/parameters/` on the node: + + ```console + $ ls /sys/module/nvidia_uvm/parameters/ + ``` + + *Example Output* + + ```output + ... + uvm_disable_hmm uvm_perf_access_counter_migration_enable uvm_perf_prefetch_min_faults + uvm_downgrade_force_membar_sys uvm_perf_access_counter_threshold uvm_perf_prefetch_threshold + ... + ``` + + Then check the value of the parameter: + + ```console + $ cat /sys/module/nvidia_uvm/parameters/uvm_disable_hmm + ``` + + *Example Output* + + ```output + Y + ``` diff --git a/gpu-operator/.agents/skills/gpu-operator-install-http-proxy/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-install-http-proxy/SKILL.md index 862bbe1d8..de29c0637 100644 --- a/gpu-operator/.agents/skills/gpu-operator-install-http-proxy/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-install-http-proxy/SKILL.md @@ -20,93 +20,43 @@ tags: # Install GPU Operator in Proxy Environments +Deploy the GPU Operator in clusters behind an HTTP proxy. By default the +Operator needs internet access to pull container images and to let the +`driver` container download OS packages; this skill configures the `driver` +container to route that traffic through the proxy. Configuring Kubernetes / +container-runtime components for the proxy is out of scope (not GPU-Operator-specific). + +> [!TIP] +> Using precompiled drivers removes the need for the `driver` container to download OS packages (use the `gpu-operator-precompiled-drivers` skill). + ## Prerequisites - A Kubernetes cluster configured with HTTP proxy settings, where the container runtime is enabled with the HTTP proxy. - The `kubectl` and `helm` CLIs available on a client machine. -## Introduction - -This page describes how to successfully deploy the GPU Operator in clusters behind an HTTP proxy. -By default, the GPU Operator requires internet access for the following reasons: - -1. Container images need to be pulled during GPU Operator installation. -1. The `driver` container needs to download several OS packages prior to driver installation. - - > [!TIP] - > Using precompiled drivers removes the need for the `driver` containers to download operating system packages (use the `gpu-operator-precompiled-drivers` skill). - -To address these requirements, all Kubernetes nodes as well as the `driver` container need proper configuration in order to direct traffic through the proxy. - -This document demonstrates how to configure the GPU Operator so that the `driver` container can successfully -download packages behind a HTTP proxy. Since configuring Kubernetes/container runtime components to use -a proxy is not specific to the GPU Operator, we do not include those instructions here. - -The instructions for Openshift are different, so skip the **HTTP Proxy Configuration for Openshift** section if you are not running Openshift. - -## HTTP Proxy Configuration for Openshift - -For Openshift, it is recommended to use the cluster-wide Proxy object to provide proxy information for the cluster. -Follow the procedure described in [Configuring the cluster-wide proxy](https://docs.openshift.com/container-platform/latest/networking/enable-cluster-wide-proxy.html) -from Red Hat Openshift public documentation. The GPU Operator will automatically inject proxy related ENV into the `driver` container -based on information present in the cluster-wide Proxy object. - -## HTTP Proxy Configuration - -> [!NOTE] -> Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). - -First, get the `values.yaml` file used for GPU Operator configuration: - -```console -$ curl -sO https://raw.githubusercontent.com/NVIDIA/gpu-operator//deployments/gpu-operator/values.yaml -``` - -Specify `driver.env` in `values.yaml` with appropriate HTTP_PROXY, HTTPS_PROXY, and NO_PROXY environment variables -(in both uppercase and lowercase). - -```yaml -driver: - env: - - name: HTTPS_PROXY - value: http:// - - name: HTTP_PROXY - value: http:// - - name: NO_PROXY - value: - - name: https_proxy - value: http:// - - name: http_proxy - value: http:// - - name: no_proxy - value: -``` - -> [!NOTE] -> * Proxy related ENV are automatically injected by GPU Operator into the `driver` container to indicate proxy information used when downloading necessary packages. -> * If HTTPS Proxy server is setup then change the values of HTTPS_PROXY and https_proxy to use `https` instead. - -## Deploy GPU Operator +## Activation -Download and deploy GPU Operator Helm Chart with the updated `values.yaml`. +Do this first: choose the phase matching your platform from the Phases table +below, then **read the corresponding `references/.md` file before +acting**. All command sequences, `values.yaml` content, and manifest details +live only in those reference files — do not improvise commands from this +dispatch layer. -Fetch the chart from the NGC repository: +## Phases -```console -$ helm fetch https://helm.ngc.nvidia.com/nvidia/charts/gpu-operator-.tgz -``` +| Phase | Summary | Reference | +|-------|---------|-----------| +| Openshift | Use the cluster-wide Proxy object; the Operator auto-injects proxy ENV into the `driver` container. Skip the non-Openshift phase. | [references/openshift.md](references/openshift.md) | +| Configure and deploy (non-Openshift) | Fetch `values.yaml`, set `driver.env` proxy variables (upper- and lowercase HTTP_PROXY/HTTPS_PROXY/NO_PROXY), fetch the chart, and install with the updated values. | [references/configure-and-deploy.md](references/configure-and-deploy.md) | -Install the GPU Operator with updated `values.yaml`: +## Hard rules (apply across all phases) -```console -$ helm install --wait gpu-operator \ - -n gpu-operator --create-namespace \ - gpu-operator-.tgz \ - -f values.yaml -``` +- On Openshift, do not hand-edit `driver.env`; configure the cluster-wide Proxy object and let the Operator inject the values. +- Set proxy variables in both uppercase and lowercase forms. +- Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). Never hardcode a version. -Check the status of the pods to ensure all the containers are running: +## Verification -```console -$ kubectl get pods -n gpu-operator -``` +After install, run `kubectl get pods -n gpu-operator` and confirm all +containers reach `Running`/`Completed`. Exact commands are in +[references/configure-and-deploy.md](references/configure-and-deploy.md). diff --git a/gpu-operator/.agents/skills/gpu-operator-install-http-proxy/references/configure-and-deploy.md b/gpu-operator/.agents/skills/gpu-operator-install-http-proxy/references/configure-and-deploy.md new file mode 100644 index 000000000..21ed2d10c --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-install-http-proxy/references/configure-and-deploy.md @@ -0,0 +1,62 @@ + + + +# HTTP Proxy Configuration (non-Openshift) and Deploy + +> [!NOTE] +> Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). + +First, get the `values.yaml` file used for GPU Operator configuration: + +```console +$ curl -sO https://raw.githubusercontent.com/NVIDIA/gpu-operator//deployments/gpu-operator/values.yaml +``` + +Specify `driver.env` in `values.yaml` with appropriate HTTP_PROXY, HTTPS_PROXY, and NO_PROXY environment variables +(in both uppercase and lowercase). + +```yaml +driver: + env: + - name: HTTPS_PROXY + value: http:// + - name: HTTP_PROXY + value: http:// + - name: NO_PROXY + value: + - name: https_proxy + value: http:// + - name: http_proxy + value: http:// + - name: no_proxy + value: +``` + +> [!NOTE] +> * Proxy related ENV are automatically injected by GPU Operator into the `driver` container to indicate proxy information used when downloading necessary packages. +> * If HTTPS Proxy server is setup then change the values of HTTPS_PROXY and https_proxy to use `https` instead. + +## Deploy GPU Operator + +Download and deploy GPU Operator Helm Chart with the updated `values.yaml`. + +Fetch the chart from the NGC repository: + +```console +$ helm fetch https://helm.ngc.nvidia.com/nvidia/charts/gpu-operator-.tgz +``` + +Install the GPU Operator with updated `values.yaml`: + +```console +$ helm install --wait gpu-operator \ + -n gpu-operator --create-namespace \ + gpu-operator-.tgz \ + -f values.yaml +``` + +Check the status of the pods to ensure all the containers are running: + +```console +$ kubectl get pods -n gpu-operator +``` diff --git a/gpu-operator/.agents/skills/gpu-operator-install-http-proxy/references/openshift.md b/gpu-operator/.agents/skills/gpu-operator-install-http-proxy/references/openshift.md new file mode 100644 index 000000000..8db3421e3 --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-install-http-proxy/references/openshift.md @@ -0,0 +1,9 @@ + + + +# HTTP Proxy Configuration for Openshift + +For Openshift, it is recommended to use the cluster-wide Proxy object to provide proxy information for the cluster. +Follow the procedure described in [Configuring the cluster-wide proxy](https://docs.openshift.com/container-platform/latest/networking/enable-cluster-wide-proxy.html) +from Red Hat Openshift public documentation. The GPU Operator will automatically inject proxy related ENV into the `driver` container +based on information present in the cluster-wide Proxy object. diff --git a/gpu-operator/.agents/skills/gpu-operator-install-outdated-kernels/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-install-outdated-kernels/SKILL.md index 76d96ac2a..09386cd26 100644 --- a/gpu-operator/.agents/skills/gpu-operator-install-outdated-kernels/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-install-outdated-kernels/SKILL.md @@ -21,102 +21,41 @@ tags: # Considerations when Installing with Outdated Kernels in Cluster +When a GPU node runs a kernel that is not the latest available, the `driver` +container can fail to find matching kernel packages (kernel-headers, +kernel-devel) and logs `Could not resolve Linux kernel version`. Upgrading to +the latest kernel is the preferred fix; when that is not an option, this skill +provides a workaround that mounts an archived package repository into the +`driver` container. + ## Prerequisites - A running Kubernetes cluster with NVIDIA GPU worker nodes. - The `kubectl` and `helm` CLIs available on a client machine. - One or more GPU nodes whose running kernel is not the latest available kernel, where the `driver` container reports `Could not resolve Linux kernel version`. -## About This Workaround - -The `driver` container deployed as part of the GPU Operator requires certain packages to be available as part of the driver installation. -On GPU nodes where the running kernel is not the latest, the `driver` container may fail to find the right version of these packages -(e.g. kernel-headers, kernel-devel) that correspond to the running kernel version. In the `driver` container logs, you will most likely -see the following error message: `Could not resolve Linux kernel version`. - -In general, upgrading your system to the latest kernel should fix this issue. But if this is not an option, the following is a -workaround to successfully deploy the GPU Operator when GPU nodes in your cluster may not be running the latest kernel. - -## Add Archived Package Repositories - -The workaround is to find the package archive containing packages for your outdated kernel and to add this repository to the package -manager running inside the `driver` container. To achieve this, we can simply mount a repository list file into the `driver` container using a `ConfigMap`. -The `ConfigMap` containing the repository list file needs to be created in the `gpu-operator` namespace. - -Let us demonstrate this workaround via an example. The system used in this example is running CentOS 7 with an outdated kernel: - -```console -$ uname -r -3.10.0-1062.12.1.el7.x86_64 -``` - -The official archive for older CentOS packages is https://vault.centos.org/. Typically, most archived CentOS repositories -are found in `/etc/yum.repos.d/CentOS-Vault.repo` but they are disabled by default. If the appropriate archive repository -was enabled, then the `driver` container would resolve the kernel version and be able to install the correct versions -of the prerequisite packages. - -We can simply drop in a replacement of `/etc/yum.repos.d/CentOS-Vault.repo` to ensure the appropriate CentOS archive is enabled. -For the kernel running in this example, the `CentOS-7.7.1908` archive contains the kernel-headers version we are looking for. -Here is our example drop-in replacement file: - -```text -[C7.7.1908-base] -name=CentOS-7.7.1908 - Base -baseurl=http://vault.centos.org/7.7.1908/os/$basearch/ -gpgcheck=1 -gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-7 -enabled=1 - -[C7.7.1908-updates] -name=CentOS-7.7.1908 - Updates -baseurl=http://vault.centos.org/7.7.1908/updates/$basearch/ -gpgcheck=1 -gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-7 -enabled=1 -``` - -Once the repo list file is created, we can create a `ConfigMap` for it: - -```console -$ kubectl create configmap repo-config -n gpu-operator --from-file= -``` - -Once the `ConfigMap` is created using the above command, update `values.yaml` with this information, to let the GPU Operator mount the repo configuration -within the `driver` container to pull required packages. - -For Ubuntu: - -```yaml -driver: - repoConfig: - configMapName: repo-config - destinationDir: /etc/apt/sources.list.d -``` +## Activation -For RHEL/Centos/RHCOS: +Do this first: read the workaround reference below **before acting**. All +command sequences, the repo-list file contents, the `values.yaml` snippets, +and the install command live only in the reference file — do not improvise +commands from this dispatch layer. -```yaml -driver: - repoConfig: - configMapName: repo-config - destinationDir: /etc/yum.repos.d -``` +## Steps -> [!NOTE] -> Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). +| Step | Summary | Reference | +|------|---------|-----------| +| Workaround | Identify the running kernel, find the matching archived package repo, create a repo-list `ConfigMap` in `gpu-operator`, set `driver.repoConfig` in `values.yaml` (Ubuntu vs RHEL/CentOS/RHCOS paths), and install. | [references/workaround.md](references/workaround.md) | -Deploy GPU Operator with updated `values.yaml`: +## Hard rules (apply across all phases) -```console -$ helm install --wait --generate-name \ - -n gpu-operator --create-namespace \ - nvidia/gpu-operator \ - --version= \ - -f values.yaml -``` +- Prefer upgrading the node kernel; the archived-repo workaround is for when upgrading is not an option. +- Create the repo-config `ConfigMap` in the `gpu-operator` namespace. +- Use the `destinationDir` matching the node OS family (`/etc/apt/sources.list.d` for Ubuntu, `/etc/yum.repos.d` for RHEL/CentOS/RHCOS). +- Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). Never hardcode a version. -Check the status of the pods to ensure all the containers are running: +## Verification -```console -$ kubectl get pods -n gpu-operator -``` +After deploying, run `kubectl get pods -n gpu-operator` and confirm all +containers are `Running`. Exact commands are in +[references/workaround.md](references/workaround.md). diff --git a/gpu-operator/.agents/skills/gpu-operator-install-outdated-kernels/references/workaround.md b/gpu-operator/.agents/skills/gpu-operator-install-outdated-kernels/references/workaround.md new file mode 100644 index 000000000..a2e52a69a --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-install-outdated-kernels/references/workaround.md @@ -0,0 +1,98 @@ + + + +# Add Archived Package Repositories (Workaround) + +## About this workaround + +The `driver` container deployed as part of the GPU Operator requires certain packages to be available as part of the driver installation. +On GPU nodes where the running kernel is not the latest, the `driver` container may fail to find the right version of these packages +(e.g. kernel-headers, kernel-devel) that correspond to the running kernel version. In the `driver` container logs, you will most likely +see the following error message: `Could not resolve Linux kernel version`. + +In general, upgrading your system to the latest kernel should fix this issue. But if this is not an option, the following is a +workaround to successfully deploy the GPU Operator when GPU nodes in your cluster may not be running the latest kernel. + +## Procedure + +The workaround is to find the package archive containing packages for your outdated kernel and to add this repository to the package +manager running inside the `driver` container. To achieve this, we can simply mount a repository list file into the `driver` container using a `ConfigMap`. +The `ConfigMap` containing the repository list file needs to be created in the `gpu-operator` namespace. + +Let us demonstrate this workaround via an example. The system used in this example is running CentOS 7 with an outdated kernel: + +```console +$ uname -r +3.10.0-1062.12.1.el7.x86_64 +``` + +The official archive for older CentOS packages is https://vault.centos.org/. Typically, most archived CentOS repositories +are found in `/etc/yum.repos.d/CentOS-Vault.repo` but they are disabled by default. If the appropriate archive repository +was enabled, then the `driver` container would resolve the kernel version and be able to install the correct versions +of the prerequisite packages. + +We can simply drop in a replacement of `/etc/yum.repos.d/CentOS-Vault.repo` to ensure the appropriate CentOS archive is enabled. +For the kernel running in this example, the `CentOS-7.7.1908` archive contains the kernel-headers version we are looking for. +Here is our example drop-in replacement file: + +```text +[C7.7.1908-base] +name=CentOS-7.7.1908 - Base +baseurl=http://vault.centos.org/7.7.1908/os/$basearch/ +gpgcheck=1 +gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-7 +enabled=1 + +[C7.7.1908-updates] +name=CentOS-7.7.1908 - Updates +baseurl=http://vault.centos.org/7.7.1908/updates/$basearch/ +gpgcheck=1 +gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-7 +enabled=1 +``` + +Once the repo list file is created, we can create a `ConfigMap` for it: + +```console +$ kubectl create configmap repo-config -n gpu-operator --from-file= +``` + +Once the `ConfigMap` is created using the above command, update `values.yaml` with this information, to let the GPU Operator mount the repo configuration +within the `driver` container to pull required packages. + +For Ubuntu: + +```yaml +driver: + repoConfig: + configMapName: repo-config + destinationDir: /etc/apt/sources.list.d +``` + +For RHEL/Centos/RHCOS: + +```yaml +driver: + repoConfig: + configMapName: repo-config + destinationDir: /etc/yum.repos.d +``` + +> [!NOTE] +> Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). + +Deploy GPU Operator with updated `values.yaml`: + +```console +$ helm install --wait --generate-name \ + -n gpu-operator --create-namespace \ + nvidia/gpu-operator \ + --version= \ + -f values.yaml +``` + +Check the status of the pods to ensure all the containers are running: + +```console +$ kubectl get pods -n gpu-operator +``` diff --git a/gpu-operator/.agents/skills/gpu-operator-install-service-mesh/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-install-service-mesh/SKILL.md index 9b09024a1..eccf644bc 100644 --- a/gpu-operator/.agents/skills/gpu-operator-install-service-mesh/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-install-service-mesh/SKILL.md @@ -20,54 +20,41 @@ tags: # Install GPU Operator with Service Mesh +Run the NVIDIA GPU Operator in a cluster that uses an Istio CNI or Linkerd CNI +service mesh. The core requirement is that the driver's `k8s-driver-manager` +init container can reach the Kubernetes API server, which conflicts with +default sidecar injection. + ## Prerequisites - A running Kubernetes cluster with NVIDIA GPU worker nodes. - A service mesh based on Istio CNI or Linkerd CNI installed in the cluster. - The `kubectl` and `helm` CLIs available on a client machine. -## Special Considerations for Service Meshes - -You can use NVIDIA GPU Operator in a cluster that uses a service mesh provided by Istio CNI or Linkerd CNI. - -The typical consideration for using the Operator with a service mesh is that the `k8s-driver-manager` init container -for the `driver` container needs network access to the Kubernetes API server of the cluster. - -The data plane---implemented by Istio CNI or Linkerd CNI as proxies running as sidecar containers---must be running for any pod networking to work. -The proxy sidecar containers start only after the init phase of the pod, so init containers are not able to communicate with the API server. - -To address the connectivity challenge, NVIDIA recommends disabling injection for the GPU Operator namespace. -Refer to the following documentation for more information: +## Activation -- [Controlling the injection policy](https://istio.io/latest/docs/setup/additional-setup/sidecar-injection/#controlling-the-injection-policy) - in the Istio documentation. -- [Overriding injection](https://linkerd.io/2.14/features/proxy-injection/#overriding-injection) - in the Linkerd documentation. +Do this first: identify which phase the user's request maps to in the Phases +table below, then **read the corresponding `references/.md` file before +acting**. All command sequences and verification output live only in those +reference files — do not improvise commands from this dispatch layer. -## Label the Namespace to Disable Injection +## Phases -- Label the Operator namespace to prevent automatic injection: +| Phase | Summary | Reference | +|-------|---------|-----------| +| Considerations | Why service-mesh sidecars break the driver init container's API-server access, and why NVIDIA recommends disabling injection for the Operator namespace. | [references/considerations.md](references/considerations.md) | +| Disable injection | Label the `gpu-operator` namespace to disable Istio/Linkerd sidecar injection, then install (via the `gpu-operator-install` skill) and verify pods start. | [references/disable-injection.md](references/disable-injection.md) | - ```console - $ kubectl label namespace gpu-operator istio-injection=disabled - ``` +## Hard rules (apply across all phases) - Or, for Linkerd: - - ```console - $ kubectl label namespace gpu-operator linkerd.io/inject=disabled - ``` - -If the GPU Operator is not already installed, use the `gpu-operator-install` skill for information about custom options and common installation scenarios. +- Disable sidecar injection for the `gpu-operator` namespace specifically; do not disable it cluster-wide. +- Use the injection-disable label matching your mesh (`istio-injection=disabled` for Istio, `linkerd.io/inject=disabled` for Linkerd). +- For Operator install options and scenarios, defer to the `gpu-operator-install` skill rather than duplicating install steps here. ## Verification -After labeling the namespace and installing the Operator, confirm that the GPU Operator pods start successfully despite the service mesh: - -1. Confirm the Operator pods are running: - - ```console - $ kubectl get pods -n gpu-operator - ``` - - All operands, including the `nvidia-driver-daemonset` and `nvidia-operator-validator` pods, should report `Running` or `Completed`. If the `k8s-driver-manager` init container is stuck, confirm that sidecar injection is disabled for the `gpu-operator` namespace. +After labeling and installing, confirm all Operator operands (including +`nvidia-driver-daemonset` and `nvidia-operator-validator`) report `Running` or +`Completed`. A stuck `k8s-driver-manager` init container indicates injection is +still enabled for the namespace. Exact commands are in +[references/disable-injection.md](references/disable-injection.md). diff --git a/gpu-operator/.agents/skills/gpu-operator-install-service-mesh/references/considerations.md b/gpu-operator/.agents/skills/gpu-operator-install-service-mesh/references/considerations.md new file mode 100644 index 000000000..de3270437 --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-install-service-mesh/references/considerations.md @@ -0,0 +1,20 @@ + + + +# Special Considerations for Service Meshes + +You can use NVIDIA GPU Operator in a cluster that uses a service mesh provided by Istio CNI or Linkerd CNI. + +The typical consideration for using the Operator with a service mesh is that the `k8s-driver-manager` init container +for the `driver` container needs network access to the Kubernetes API server of the cluster. + +The data plane---implemented by Istio CNI or Linkerd CNI as proxies running as sidecar containers---must be running for any pod networking to work. +The proxy sidecar containers start only after the init phase of the pod, so init containers are not able to communicate with the API server. + +To address the connectivity challenge, NVIDIA recommends disabling injection for the GPU Operator namespace. +Refer to the following documentation for more information: + +- [Controlling the injection policy](https://istio.io/latest/docs/setup/additional-setup/sidecar-injection/#controlling-the-injection-policy) + in the Istio documentation. +- [Overriding injection](https://linkerd.io/2.14/features/proxy-injection/#overriding-injection) + in the Linkerd documentation. diff --git a/gpu-operator/.agents/skills/gpu-operator-install-service-mesh/references/disable-injection.md b/gpu-operator/.agents/skills/gpu-operator-install-service-mesh/references/disable-injection.md new file mode 100644 index 000000000..e86910b2e --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-install-service-mesh/references/disable-injection.md @@ -0,0 +1,30 @@ + + + +# Label the Namespace to Disable Injection + +- Label the Operator namespace to prevent automatic injection: + + ```console + $ kubectl label namespace gpu-operator istio-injection=disabled + ``` + + Or, for Linkerd: + + ```console + $ kubectl label namespace gpu-operator linkerd.io/inject=disabled + ``` + +If the GPU Operator is not already installed, use the `gpu-operator-install` skill for information about custom options and common installation scenarios. + +## Verification + +After labeling the namespace and installing the Operator, confirm that the GPU Operator pods start successfully despite the service mesh: + +1. Confirm the Operator pods are running: + + ```console + $ kubectl get pods -n gpu-operator + ``` + + All operands, including the `nvidia-driver-daemonset` and `nvidia-operator-validator` pods, should report `Running` or `Completed`. If the `k8s-driver-manager` init container is stuck, confirm that sidecar injection is disabled for the `gpu-operator` namespace. diff --git a/gpu-operator/.agents/skills/gpu-operator-uninstalling-nvidia/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-uninstalling-nvidia/SKILL.md index 0fe63edb8..bf9bc9ae5 100644 --- a/gpu-operator/.agents/skills/gpu-operator-uninstalling-nvidia/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-uninstalling-nvidia/SKILL.md @@ -20,84 +20,37 @@ tags: # Uninstalling the GPU Operator +Remove the NVIDIA GPU Operator from a Kubernetes cluster and clean up the +resources it leaves behind, including driver custom resources, CRDs, and loaded +kernel modules. + ## Prerequisites - A Kubernetes cluster with the NVIDIA GPU Operator installed. - The `kubectl` and `helm` CLIs available on a client machine, with access to the cluster and the namespace where the Operator is installed (typically `gpu-operator`). -## Procedure - -Perform the following steps to uninstall the Operator. - -1. Optional: List and delete NVIDIA driver custom resources. - - ```console - $ kubectl get nvidiadrivers - ``` - - *Example Output* - - ```output - NAME STATUS AGE - demo-gold ready 2023-10-16T17:57:12Z - demo-silver ready 2023-10-16T17:57:12Z - ``` - - ```console - $ kubectl delete nvidiadriver demo-gold - $ kubectl delete nvidiadriver demo-silver - ``` - - ```console - $ kubectl delete crd nvidiadrivers.nvidia.com - ``` - -1. Delete the Operator: - - ```console - $ helm delete -n gpu-operator $(helm list -n gpu-operator | grep gpu-operator | awk '{print $1}') - ``` - -1. Optional: List the pods in the Operator namespace to confirm the pods are deleted or in the process of deleting: - - ```console - $ kubectl get pods -n gpu-operator - ``` - - *Example Output* +## Activation - ```output - No resources found. - ``` +Do this first: identify which phase the user's request maps to in the Phases +table below, then **read the corresponding `references/.md` file before +acting**. All command sequences and expected output live only in those +reference files — do not improvise commands from this dispatch layer. -By default, Helm does not [support deleting existing CRDs](https://helm.sh/docs/chart_best_practices/custom_resource_definitions/#some-caveats-and-explanations) -when you delete the chart. -As a result, the `clusterpolicy` CRD and `nvidiadrivers` CRD will still remain, by default. +## Phases -```console -$ kubectl get crd clusterpolicies.nvidia.com -``` +| Phase | Summary | Reference | +|-------|---------|-----------| +| Procedure | Optionally delete `nvidiadriver` custom resources, delete the Operator Helm release, confirm pods are gone, and (note) unload lingering driver kernel modules / handle Helm-hook image-pull failures. | [references/procedure.md](references/procedure.md) | +| CRD cleanup | Why the `clusterpolicies` and `nvidiadrivers` CRDs survive a chart delete by default, and the two ways to remove them (`operator.cleanupCRD=true` post-delete hook, or manual `kubectl delete crd`). | [references/crd-cleanup.md](references/crd-cleanup.md) | -To overcome this, the Operator uses a [post-delete hook](https://helm.sh/docs/topics/charts_hooks/#the-available-hooks) -to perform the CRD cleanup. -The `operator.cleanupCRD` chart parameter is added to enable this hook. -This parameter is disabled by default. -You can enable the hook by specifying `--set operator.cleanupCRD=true` during install or upgrade to perform automatic CRD cleanup on chart deletion. +## Hard rules (apply across all phases) -Alternatively, you can delete the custom resource definition: +- Helm does not delete CRDs on chart removal by default; the `clusterpolicies.nvidia.com` CRD persists unless explicitly cleaned up. +- Helm hooks run the Operator image itself; if the image cannot be pulled, delete with `--no-hooks` to avoid hanging. +- Driver kernel modules can remain loaded after uninstall; reboot or `rmmod` to fully remove them. -```console -$ kubectl delete crd clusterpolicies.nvidia.com -``` +## Verification -> [!NOTE] -> - After uninstalling the Operator, the NVIDIA driver modules might still be loaded. -> Either reboot the node or unload them using the following command: -> -> ```console -> $ sudo rmmod nvidia_modeset nvidia_uvm nvidia -> ``` -> -> - Helm hooks used with the GPU Operator use the Operator image itself. -> If the Operator image cannot be pulled successfully (either due to network error or an invalid NGC registry secret in case of NVAIE), hooks will fail. -> In this case, delete the chart and specify the `--no-hooks` argument to avoid hanging on hook failures. +After deleting the Operator release, confirm `kubectl get pods -n gpu-operator` +reports `No resources found.` Exact commands and expected output are in +[references/procedure.md](references/procedure.md). diff --git a/gpu-operator/.agents/skills/gpu-operator-uninstalling-nvidia/references/crd-cleanup.md b/gpu-operator/.agents/skills/gpu-operator-uninstalling-nvidia/references/crd-cleanup.md new file mode 100644 index 000000000..87c054d69 --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-uninstalling-nvidia/references/crd-cleanup.md @@ -0,0 +1,24 @@ + + + +# CRD Cleanup + +By default, Helm does not [support deleting existing CRDs](https://helm.sh/docs/chart_best_practices/custom_resource_definitions/#some-caveats-and-explanations) +when you delete the chart. +As a result, the `clusterpolicy` CRD and `nvidiadrivers` CRD will still remain, by default. + +```console +$ kubectl get crd clusterpolicies.nvidia.com +``` + +To overcome this, the Operator uses a [post-delete hook](https://helm.sh/docs/topics/charts_hooks/#the-available-hooks) +to perform the CRD cleanup. +The `operator.cleanupCRD` chart parameter is added to enable this hook. +This parameter is disabled by default. +You can enable the hook by specifying `--set operator.cleanupCRD=true` during install or upgrade to perform automatic CRD cleanup on chart deletion. + +Alternatively, you can delete the custom resource definition: + +```console +$ kubectl delete crd clusterpolicies.nvidia.com +``` diff --git a/gpu-operator/.agents/skills/gpu-operator-uninstalling-nvidia/references/procedure.md b/gpu-operator/.agents/skills/gpu-operator-uninstalling-nvidia/references/procedure.md new file mode 100644 index 000000000..d58cff032 --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-uninstalling-nvidia/references/procedure.md @@ -0,0 +1,59 @@ + + + +# Uninstall Procedure + +Perform the following steps to uninstall the Operator. + +1. Optional: List and delete NVIDIA driver custom resources. + + ```console + $ kubectl get nvidiadrivers + ``` + + *Example Output* + + ```output + NAME STATUS AGE + demo-gold ready 2023-10-16T17:57:12Z + demo-silver ready 2023-10-16T17:57:12Z + ``` + + ```console + $ kubectl delete nvidiadriver demo-gold + $ kubectl delete nvidiadriver demo-silver + ``` + + ```console + $ kubectl delete crd nvidiadrivers.nvidia.com + ``` + +1. Delete the Operator: + + ```console + $ helm delete -n gpu-operator $(helm list -n gpu-operator | grep gpu-operator | awk '{print $1}') + ``` + +1. Optional: List the pods in the Operator namespace to confirm the pods are deleted or in the process of deleting: + + ```console + $ kubectl get pods -n gpu-operator + ``` + + *Example Output* + + ```output + No resources found. + ``` + +> [!NOTE] +> - After uninstalling the Operator, the NVIDIA driver modules might still be loaded. +> Either reboot the node or unload them using the following command: +> +> ```console +> $ sudo rmmod nvidia_modeset nvidia_uvm nvidia +> ``` +> +> - Helm hooks used with the GPU Operator use the Operator image itself. +> If the Operator image cannot be pulled successfully (either due to network error or an invalid NGC registry secret in case of NVAIE), hooks will fail. +> In this case, delete the chart and specify the `--no-hooks` argument to avoid hanging on hook failures. From 0a4a707275fc2d1aea6a92f2eb42f5bc1074057a Mon Sep 17 00:00:00 2001 From: Andrew Chen Date: Mon, 1 Jun 2026 21:51:49 -0700 Subject: [PATCH 10/13] refactor(skills): info-hiding restructure batch 2 (azure, nvaie, gov-ready) Restructure 3 procedural GPU Operator skills to the dispatch-layer information-hiding pattern: thin SKILL.md (<200 lines) with all step-by-step detail moved into phase-specific references/*.md. Skills: nvidia-azure (AKS approaches + preinstalled-driver install), install-nvidia-enterprise (concepts, vGPU-driver, NLS-token-update, data-center-driver), install-governmentready-environments (overview, install, Ubuntu-Pro-token-update). All three pass the close-your-eyes test. Verified-fix content (prerequisites, verification, placeholders, cross-refs) preserved. Co-Authored-By: Claude Opus 4.8 (1M context) Signed-off-by: Andrew Chen --- .../SKILL.md | 173 +++--------------- .../references/install.md | 118 ++++++++++++ .../references/overview.md | 29 +++ .../references/update-ubuntu-pro-token.md | 16 ++ .../SKILL.md | 161 +++------------- .../references/concepts.md | 22 +++ .../references/datacenter-driver.md | 40 ++++ .../references/nls-token-update.md | 50 +++++ .../references/vgpu-driver.md | 40 ++++ .../skills/gpu-operator-nvidia-azure/SKILL.md | 128 +++---------- .../references/approaches.md | 56 ++++++ .../references/install-preinstalled.md | 59 ++++++ 12 files changed, 509 insertions(+), 383 deletions(-) create mode 100644 gpu-operator/.agents/skills/gpu-operator-install-governmentready-environments/references/install.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-install-governmentready-environments/references/overview.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-install-governmentready-environments/references/update-ubuntu-pro-token.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-install-nvidia-enterprise/references/concepts.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-install-nvidia-enterprise/references/datacenter-driver.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-install-nvidia-enterprise/references/nls-token-update.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-install-nvidia-enterprise/references/vgpu-driver.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-nvidia-azure/references/approaches.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-nvidia-azure/references/install-preinstalled.md diff --git a/gpu-operator/.agents/skills/gpu-operator-install-governmentready-environments/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-install-governmentready-environments/SKILL.md index 1e1c35459..4b9be6b7b 100644 --- a/gpu-operator/.agents/skills/gpu-operator-install-governmentready-environments/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-install-governmentready-environments/SKILL.md @@ -20,162 +20,43 @@ tags: # NVIDIA GPU Operator Government Ready +Install the government-ready NVIDIA GPU Operator for NVIDIA AI Enterprise +customers deploying into FedRAMP High or equivalent sovereign environments. The +government-ready path uses STIG/FIPS driver images from NGC, an Ubuntu Pro token +for FIPS-kernel package access, and a privileged pod-security namespace policy. + ## Prerequisites - A running Kubernetes cluster with NVIDIA GPU worker nodes. - The `kubectl` and `helm` CLIs available on a client machine. - An NVIDIA AI Enterprise subscription. Government-ready components are available to NVIDIA AI Enterprise customers for FedRAMP High or equivalent sovereign use cases. -## Overview - -The NVIDIA GPU Operator now offers government-ready components for NVIDIA AI Enterprise customers. -Government ready is NVIDIA's designation for software that meets applicable security requirements for deployment in your FedRAMP High or equivalent sovereign use case. -For more information on NVIDIA's government-ready support, refer to the white paper [AI Software for Regulated Environments](https://docs.nvidia.com/ai-enterprise/planning-resource/ai-software-regulated-environments-white-paper/latest/index.html). - -## Supported GPU Operator Components - -Refer to the [GPU Operator Component Matrix](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/life-cycle-policy.html#gpu-operator-component-matrix) for a full list of supported government-ready GPU Operator components. - -Artifacts for these components are available from the [NVIDIA NGC Catalog](https://registry.ngc.nvidia.com/orgs/nvstaging/teams/cloud-native/containers/gpu-driver-stig-fips). - -> [!NOTE] -> Not all GPU Operator components and features are available as government-ready containers in this release. -> For example, NVIDIA GDS Driver, NVIDIA Confidential Computing Manager, and NVIDIA GDRCopy Driver are not yet supported. - -## Validated Kubernetes Distributions - -The government-ready NVIDIA GPU Operator has been validated on the following Kubernetes distributions: - -- Canonical Kubernetes 1.34 with Ubuntu Pro 24.04 and FIPS-compliant kernel -- Red Hat OpenShift 4.19 in FIPS mode -- Rancher Kubernetes Engine 2 with Ubuntu 24.04 -- VMware VKS with Ubuntu 24.04 - -## Install Government-Ready NVIDIA GPU Operator - -Once you have your gov-ready-prerequisites configured, use the following steps to install the NVIDIA GPU Operator on Canonical Kubernetes distributions: - -1. install-nfd -1. create-ngc-api-pull-secret -1. create-ubuntu-pro-token-secret -1. deploy-nvidia-gpu-operator-gov-ready - -> [!NOTE] -> For deployment on OpenShift, refer to [Install GPU Operator (government-ready) on OpenShift](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/install-gpu-operator-gov-ready-openshift.html). -### Prerequisites - -- An active NVIDIA AI Enterprise subscription and NGC API token to access GPU Operator government-ready containers. - Refer to [Generating Your NGC API Key](https://docs.nvidia.com/ngc/gpu-cloud/ngc-user-guide/index.html#generating-api-key) in the NVIDIA NGC User Guide for more information on NGC API tokens. - -- An Ubuntu Pro token for Canonical Kubernetes deployments. - This token is required for the driver container to download kernel headers and other necessary packages from the Canonical repository when using the FIPS-enabled kernel on Ubuntu 24.04. - Refer to the [Ubuntu Pro documentation](https://documentation.ubuntu.com/pro-client/en/v30/howtoguides/get_token_and_attach/) for more information on accessing Ubuntu Pro tokens. - -- The `helm` CLI installed on a client machine. - - You can run the following commands to install the Helm CLI: - - ```console - $ curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \ - && chmod 700 get_helm.sh \ - && ./get_helm.sh - ``` - -- A namespace to deploy the NVIDIA GPU Operator. - The example install commands below use `gpu-operator` as the namespace. - -- Optionally, Service Mesh for intra-cluster traffic encryption. - By default, the NVIDIA GPU Operator does not encrypt traffic between its controller (and operands) and the Kubernetes API server. - If you wish to encrypt this communication, you should deploy and maintain a service mesh application within the Kubernetes cluster to enable secure traffic. - -### Install Node Feature Discovery (NFD) - -NFD is an open-source project that is a dependency for the Operator on each node in your cluster. -It must be deployed before installing the NVIDIA GPU Operator. - -GPU Operator does not maintain a government ready version of NFD, it is recommended that you install the upstream NFD version that aligns with the operator-component-matrix. -The NFD container is built on top of a scratch image, providing a highly secure container environment. -For information on NFD CVEs and security updates, refer to the [NFD GitHub repository](https://github.com/kubernetes-sigs/node-feature-discovery/security). - -Refer to the NFD documentation for [installation instructions](https://kubernetes-sigs.github.io/node-feature-discovery/stable/get-started/index.html). - -### Create NGC API Pull Secret - -Add a Docker registry secret for downloading the GPU Operator artifacts from NVIDIA NGC in the same namespace where you are planning to deploy the NVIDIA GPU Operator. -Update `ngc-api-key` in the command below with your NGC API key. - -```console -$ kubectl create secret -n gpu-operator docker-registry ngc-secret \ - --docker-server=nvcr.io \ - --docker-username='$oauthtoken' \ - --docker-password= -``` - -### Create Ubuntu Pro Token Secret - -Create a Kubernetes secret to hold the value of your Ubuntu Pro token secret. -This secret will be used in the install command in the next step. - -The Ubuntu Pro Token is required for the driver container to download kernel headers and other necessary packages from the Canonical repository when using the FIPS-enabled kernel on Ubuntu 24.04. - -1. Get the Ubuntu Pro token: - - ```console - $ echo UBUNTU_PRO_TOKEN= > ubuntu-fips.env - ``` - - Replace `` with your actual Ubuntu Pro token. - -2. Create Ubuntu Pro token Secret: - - ```console - $ kubectl create secret generic ubuntu-fips-secret \ - --from-env-file=./ubuntu-fips.env --namespace gpu-operator - ``` - - Note that the namespace in the above command is `gpu-operator`. - Update this to the namespace you are planning to use for the NVIDIA GPU Operator. - -### Install NVIDIA GPU Operator Government-Ready Components - -1. Label your `gpu-operator` namespace for the Operator to set the enforcement policy to privilege. - - ```console - $ kubectl label --overwrite ns gpu-operator pod-security.kubernetes.io/enforce=privileged - ``` - -1. Add the NVIDIA Helm repository: - - ```console - $ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \ - && helm repo update - ``` - -1. Install the NVIDIA GPU Operator. +## Activation - ```console - $ helm install gpu-operator nvidia/gpu-operator \ - --namespace gpu-operator \ - --set driver.secretEnv=ubuntu-fips-secret \ - --set driver.repository=nvcr.io/nvidia \ - --set driver.version=580.95.05-stig-fips \ - --set driver.image=gpu-driver-stig-fips \ - --set driver.imagePullSecrets={ngc-secret} \ - --set nfd.enabled=false - ``` +Do this first: identify which phase the user's request maps to in the Phases +table below, then **read the corresponding `references/.md` file before +acting**. All command sequences, secret manifests, Helm `--set` values, and +verification steps live only in those reference files — do not improvise +commands from this dispatch layer. -Refer to [Common Chart Customization Options](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#common-chart-customization-options) for more information about installation options. +## Phases -## Update Ubuntu Pro Token in ClusterPolicy +| Phase | Summary | Reference | +|-------|---------|-----------| +| Overview | What government-ready means (FedRAMP High / sovereign), which GPU Operator components are supported (and which are not yet), and the validated Kubernetes distributions. | [references/overview.md](references/overview.md) | +| Install | The full Canonical-Kubernetes install: detailed prerequisites (NGC token, Ubuntu Pro token, Helm, namespace, optional service mesh), install NFD, create the NGC pull secret + Ubuntu Pro token secret, label the namespace privileged, and Helm-install with the STIG/FIPS driver image. | [references/install.md](references/install.md) | +| Update Ubuntu Pro token | Rotate the Ubuntu Pro token post-install by editing the secret named in `driver.secretEnv`. | [references/update-ubuntu-pro-token.md](references/update-ubuntu-pro-token.md) | -You can update your Ubuntu Pro Token after installation by editing your Ubuntu Pro Token secret. -This secret name is set as value of `driver.secretEnv` of the GPU Operator ClusterPolicy. +## Hard rules (apply across all phases) -Edit your Ubuntu Pro Token secret. +- Government-ready components require an active NVIDIA AI Enterprise subscription and NGC API token. +- On Canonical Kubernetes with the FIPS kernel, an Ubuntu Pro token (as a Kubernetes secret) is required for the driver container to fetch kernel headers. +- Label the Operator namespace `pod-security.kubernetes.io/enforce=privileged` before installing. +- Install NFD (upstream version aligned to the component matrix) before the Operator; set `nfd.enabled=false` on the Operator install. +- For OpenShift, follow the dedicated government-ready OpenShift install guide instead. -```console -$ kubectl edit secrets -``` +## Verification -Then update the secret with your new Ubuntu Pro Token. -This token is required for the driver container to download kernel headers and other necessary packages from the Canonical repository when using the FIPS-enabled kernel on Ubuntu 24.04. +Confirm the Operator and STIG/FIPS driver operands deploy successfully in the +target namespace per [references/install.md](references/install.md); for +OpenShift use the linked OpenShift guide's verification. diff --git a/gpu-operator/.agents/skills/gpu-operator-install-governmentready-environments/references/install.md b/gpu-operator/.agents/skills/gpu-operator-install-governmentready-environments/references/install.md new file mode 100644 index 000000000..79d726cfd --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-install-governmentready-environments/references/install.md @@ -0,0 +1,118 @@ + + + +# Install Government-Ready NVIDIA GPU Operator + +Once you have your prerequisites configured, use the following steps to install the NVIDIA GPU Operator on Canonical Kubernetes distributions: + +1. Install NFD +1. Create NGC API pull secret +1. Create Ubuntu Pro token secret +1. Deploy NVIDIA GPU Operator (government-ready) + +> [!NOTE] +> For deployment on OpenShift, refer to [Install GPU Operator (government-ready) on OpenShift](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/install-gpu-operator-gov-ready-openshift.html). + +## Prerequisites + +- An active NVIDIA AI Enterprise subscription and NGC API token to access GPU Operator government-ready containers. + Refer to [Generating Your NGC API Key](https://docs.nvidia.com/ngc/gpu-cloud/ngc-user-guide/index.html#generating-api-key) in the NVIDIA NGC User Guide for more information on NGC API tokens. + +- An Ubuntu Pro token for Canonical Kubernetes deployments. + This token is required for the driver container to download kernel headers and other necessary packages from the Canonical repository when using the FIPS-enabled kernel on Ubuntu 24.04. + Refer to the [Ubuntu Pro documentation](https://documentation.ubuntu.com/pro-client/en/v30/howtoguides/get_token_and_attach/) for more information on accessing Ubuntu Pro tokens. + +- The `helm` CLI installed on a client machine. + + You can run the following commands to install the Helm CLI: + + ```console + $ curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \ + && chmod 700 get_helm.sh \ + && ./get_helm.sh + ``` + +- A namespace to deploy the NVIDIA GPU Operator. + The example install commands below use `gpu-operator` as the namespace. + +- Optionally, Service Mesh for intra-cluster traffic encryption. + By default, the NVIDIA GPU Operator does not encrypt traffic between its controller (and operands) and the Kubernetes API server. + If you wish to encrypt this communication, you should deploy and maintain a service mesh application within the Kubernetes cluster to enable secure traffic. + +## Install Node Feature Discovery (NFD) + +NFD is an open-source project that is a dependency for the Operator on each node in your cluster. +It must be deployed before installing the NVIDIA GPU Operator. + +GPU Operator does not maintain a government ready version of NFD, it is recommended that you install the upstream NFD version that aligns with the operator-component-matrix. +The NFD container is built on top of a scratch image, providing a highly secure container environment. +For information on NFD CVEs and security updates, refer to the [NFD GitHub repository](https://github.com/kubernetes-sigs/node-feature-discovery/security). + +Refer to the NFD documentation for [installation instructions](https://kubernetes-sigs.github.io/node-feature-discovery/stable/get-started/index.html). + +## Create NGC API Pull Secret + +Add a Docker registry secret for downloading the GPU Operator artifacts from NVIDIA NGC in the same namespace where you are planning to deploy the NVIDIA GPU Operator. +Update `ngc-api-key` in the command below with your NGC API key. + +```console +$ kubectl create secret -n gpu-operator docker-registry ngc-secret \ + --docker-server=nvcr.io \ + --docker-username='$oauthtoken' \ + --docker-password= +``` + +## Create Ubuntu Pro Token Secret + +Create a Kubernetes secret to hold the value of your Ubuntu Pro token secret. +This secret will be used in the install command in the next step. + +The Ubuntu Pro Token is required for the driver container to download kernel headers and other necessary packages from the Canonical repository when using the FIPS-enabled kernel on Ubuntu 24.04. + +1. Get the Ubuntu Pro token: + + ```console + $ echo UBUNTU_PRO_TOKEN= > ubuntu-fips.env + ``` + + Replace `` with your actual Ubuntu Pro token. + +2. Create Ubuntu Pro token Secret: + + ```console + $ kubectl create secret generic ubuntu-fips-secret \ + --from-env-file=./ubuntu-fips.env --namespace gpu-operator + ``` + + Note that the namespace in the above command is `gpu-operator`. + Update this to the namespace you are planning to use for the NVIDIA GPU Operator. + +## Install NVIDIA GPU Operator Government-Ready Components + +1. Label your `gpu-operator` namespace for the Operator to set the enforcement policy to privilege. + + ```console + $ kubectl label --overwrite ns gpu-operator pod-security.kubernetes.io/enforce=privileged + ``` + +1. Add the NVIDIA Helm repository: + + ```console + $ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \ + && helm repo update + ``` + +1. Install the NVIDIA GPU Operator. + + ```console + $ helm install gpu-operator nvidia/gpu-operator \ + --namespace gpu-operator \ + --set driver.secretEnv=ubuntu-fips-secret \ + --set driver.repository=nvcr.io/nvidia \ + --set driver.version=580.95.05-stig-fips \ + --set driver.image=gpu-driver-stig-fips \ + --set driver.imagePullSecrets={ngc-secret} \ + --set nfd.enabled=false + ``` + +Refer to [Common Chart Customization Options](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#common-chart-customization-options) for more information about installation options. diff --git a/gpu-operator/.agents/skills/gpu-operator-install-governmentready-environments/references/overview.md b/gpu-operator/.agents/skills/gpu-operator-install-governmentready-environments/references/overview.md new file mode 100644 index 000000000..5b22f5270 --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-install-governmentready-environments/references/overview.md @@ -0,0 +1,29 @@ + + + +# Overview, Supported Components, and Validated Distributions + +## Overview + +The NVIDIA GPU Operator now offers government-ready components for NVIDIA AI Enterprise customers. +Government ready is NVIDIA's designation for software that meets applicable security requirements for deployment in your FedRAMP High or equivalent sovereign use case. +For more information on NVIDIA's government-ready support, refer to the white paper [AI Software for Regulated Environments](https://docs.nvidia.com/ai-enterprise/planning-resource/ai-software-regulated-environments-white-paper/latest/index.html). + +## Supported GPU Operator Components + +Refer to the [GPU Operator Component Matrix](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/life-cycle-policy.html#gpu-operator-component-matrix) for a full list of supported government-ready GPU Operator components. + +Artifacts for these components are available from the [NVIDIA NGC Catalog](https://registry.ngc.nvidia.com/orgs/nvstaging/teams/cloud-native/containers/gpu-driver-stig-fips). + +> [!NOTE] +> Not all GPU Operator components and features are available as government-ready containers in this release. +> For example, NVIDIA GDS Driver, NVIDIA Confidential Computing Manager, and NVIDIA GDRCopy Driver are not yet supported. + +## Validated Kubernetes Distributions + +The government-ready NVIDIA GPU Operator has been validated on the following Kubernetes distributions: + +- Canonical Kubernetes 1.34 with Ubuntu Pro 24.04 and FIPS-compliant kernel +- Red Hat OpenShift 4.19 in FIPS mode +- Rancher Kubernetes Engine 2 with Ubuntu 24.04 +- VMware VKS with Ubuntu 24.04 diff --git a/gpu-operator/.agents/skills/gpu-operator-install-governmentready-environments/references/update-ubuntu-pro-token.md b/gpu-operator/.agents/skills/gpu-operator-install-governmentready-environments/references/update-ubuntu-pro-token.md new file mode 100644 index 000000000..a600abf5a --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-install-governmentready-environments/references/update-ubuntu-pro-token.md @@ -0,0 +1,16 @@ + + + +# Update Ubuntu Pro Token in ClusterPolicy + +You can update your Ubuntu Pro Token after installation by editing your Ubuntu Pro Token secret. +This secret name is set as value of `driver.secretEnv` of the GPU Operator ClusterPolicy. + +Edit your Ubuntu Pro Token secret. + +```console +$ kubectl edit secrets +``` + +Then update the secret with your new Ubuntu Pro Token. +This token is required for the driver container to download kernel headers and other necessary packages from the Canonical repository when using the FIPS-enabled kernel on Ubuntu 24.04. diff --git a/gpu-operator/.agents/skills/gpu-operator-install-nvidia-enterprise/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-install-nvidia-enterprise/SKILL.md index 13eb70672..f1af4a093 100644 --- a/gpu-operator/.agents/skills/gpu-operator-install-nvidia-enterprise/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-install-nvidia-enterprise/SKILL.md @@ -20,151 +20,44 @@ tags: # NVIDIA AI Enterprise +Install the GPU Operator with NVIDIA AI Enterprise. There are two installation +paths: the **vGPU guest driver** (required on virtualization platforms; uses a +prebuilt licensed image and an NGC-hosted Bash installer script with NVIDIA +License System tokens) and the **data center driver** (bare-metal / non-virtualized; +public Helm chart and driver containers matched to your release's driver branch). + ## Prerequisites - A running Kubernetes cluster with NVIDIA GPU worker nodes. - The `kubectl` and `helm` CLIs available on a client machine. - An NVIDIA AI Enterprise subscription with access to the NVIDIA Enterprise Catalog (NGC) and an NGC API key for the private registry. -## About NVIDIA AI Enterprise and Supported Platforms - -NVIDIA AI Enterprise is an end-to-end, cloud-native suite of AI and data analytics software, optimized, certified, and supported by NVIDIA with NVIDIA-Certified Systems. - -Deploying the GPU Operator with NVIDIA AI Enterprise offers two installation options. - -| vGPU Guest Driver | Data Center Driver | -| --- | --- | -| Uses a a prebuilt vGPU driver image that is only available to NVIDIA AI Enterprise customers. It is configured to use the [NVIDIA License System (NLS)](https://docs.nvidia.com/license-system/latest/). Installations on virtualization platforms must use the vGPU driver installation. Installation is performed by downloading a Bash script from NVIDIA NGC and running the script. | Uses the GPU Operator Helm chart that is publicly available and GPU driver containers that are publicly available. You must determine the supported driver branch, such as 550, for your NVIDIA AI Enterprise release. Installation is performed by running the `helm` command. | -For information about supported platforms, hypervisors, and operating systems, refer to the -[Product Support Matrix](https://docs.nvidia.com/ai-enterprise/latest/product-support-matrix/index.html) -in the NVIDIA AI Enterprise documentation. - -For information about using vGPU with Red Hat OpenShift, refer to [NVIDIA AI Enterprise with OpenShift](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/nvaie-with-ocp.html). - -## Installing GPU Operator Using the vGPU Driver - -### Prerequisites - -- A client configuration token has been generated for the client on which the script will install the vGPU guest driver. - Refer to [Generating a Client Configuration Token](https://docs.nvidia.com/license-system/latest/nvidia-license-system-user-guide/index.html#generating-client-configuration-token) - in the *NVIDIA License System User Guide* for more information. -- An NGC CLI API key that is used to create an image pull secret. - The secret is used to pull the prebuilt vGPU driver image from NVIDIA NGC. - Refer to [Generating Your NGC API Key](https://docs.nvidia.com/ngc/latest/ngc-private-registry-user-guide.html#prug-generating-personal-api-key) - in the *NVIDIA NGC Private Registry User Guide* for more information. - -### Procedure - -1. Export the NGC CLI API key and your email address as environment variables: - - ```console - $ export NGC_API_KEY="M2Vub3QxYmgyZ..." - $ export NGC_USER_EMAIL="user@example.com" - ``` - -1. Go to the - [NVIDIA GPU Operator - Deploy Installer Script](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/vgpu/resources/gpu-operator-installer-5) - web page on NVIDIA NGC. - - Click the **File Browser** tab, identify your NVIDIA AI Enterprise release, click ellipses-img, and select **Download File**. - - Copy the downloaded script to the same directory as the client configuration token. - -1. Rename the client configuration token that you downloaded to `client_configuration_token.tok`. - Originally, the client configuration token is named to match the pattern: `client_configuration_token_mm-dd-yyyy-hh-mm-ss.tok`. - -1. From the directory that contains the downloaded script and the client configuration token, run the script: - - ```console - $ bash gpu-operator-nvaie.sh install - ``` - -## Updating NLS Client License Token - -In case the NLS client license token needs to be updated, use the following procedure: - -Create an empty vGPU license configuration file: - -```console -$ sudo touch gridd.conf -``` - -Generate and download a new NLS client license token. Refer to Section 4.6 of the [NLS User Guide](https://docs.nvidia.com/license-system/latest/pdf/nvidia-license-system-user-guide.pdf) for instructions. +## Activation -Rename the NLS client license token that you downloaded to `client_configuration_token.tok`. +Do this first: pick the installation path (and any token-update task) matching +your platform from the Phases table below, then **read the corresponding +`references/.md` file before acting**. All command sequences, manifest +edits, and verification output live only in those reference files — do not +improvise commands from this dispatch layer. -> [!WARNING] -> The `configMap(configMapName)` is **deprecated** and will be removed in a future release. -> Use `secrets(secretName)` instead. -> Create a new `licensing-config-new` Secret object in the `gpu-operator` namespace (make sure the name of the secret is not already used in the kubernetes cluster). Both the vGPU license configuration file and the NLS client license token will be added to this Secret: +## Phases -```console -$ kubectl create secret generic licensing-config-new \ - -n gpu-operator --from-file=gridd.conf --from-file=/client_configuration_token.tok -``` +| Phase | Summary | Reference | +|-------|---------|-----------| +| Concepts | What NVIDIA AI Enterprise is, the vGPU-guest-driver vs data-center-driver decision table, and where to check the platform support matrix. | [references/concepts.md](references/concepts.md) | +| vGPU driver install | For virtualization platforms: prerequisites (client config token, NGC API key), export env vars, download the NGC installer script, rename the token, and run `gpu-operator-nvaie.sh install`. | [references/vgpu-driver.md](references/vgpu-driver.md) | +| NLS token update | Rotate the NLS client license token: create `gridd.conf`, build a `licensing-config-new` Secret, and repoint `licensingConfig.secretName` in the cluster policy. | [references/nls-token-update.md](references/nls-token-update.md) | +| Data center driver install | For bare-metal/non-virtualized: identify the supported driver branch + matching GPU Operator version, then install via the `gpu-operator-install` skill with `--version=`; verify licensing. | [references/datacenter-driver.md](references/datacenter-driver.md) | -Edit the clusterpolicies by using the command: +## Hard rules (apply across all phases) -```console -$ kubectl edit clusterpolicies.nvidia.com -``` - -Go to the driver section and replace the following argument: - -```console -licensingConfig: - secretName: licensing-config -``` - -with - -```console -licensingConfig: - secretName: licensing-config-new -``` - -Write and exit from the kubectl edit session (you can use :qw for instance if vi utility is used) - -GPU Operator sequentially redeploys all the driver pods with this new licensing information. - -## Installing GPU Operator Using the Data Center Driver - -This installation method is available for bare metal clusters or any cluster that does not use virtualization. - -You must install the driver that matches the supported driver branch for your NVIDIA AI Enterprise release. - -To identify the correct driver branch: - -1. Refer to the [NVIDIA AI Enterprise Infra Release Branches](https://docs.nvidia.com/ai-enterprise/index.html#nvidiatab-infrastructure-software---infra-release-branches) - table to determine the driver branch for your release. - - For example, NVIDIA AI Enterprise Infra 7.x uses the R580 driver branch. - -1. Refer to the [GPU Operator Component Matrix](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/life-cycle-policy.html#gpu-operator-component-matrix) to identify the recommended GPU Operator version and driver version that uses the same driver branch. - -After identifying the correct driver version, use the `gpu-operator-install` skill for installation instructions. -Use the `--version=` argument when installing with Helm. +- Installations on virtualization platforms must use the vGPU driver path; the data center driver path is for bare-metal / non-virtualized clusters only. +- The vGPU path requires a valid NLS client configuration token renamed to `client_configuration_token.tok`. +- Prefer `secrets(secretName)` for licensing config; `configMap(configMapName)` is deprecated. +- Match the driver branch to your NVIDIA AI Enterprise release per the component matrix; never hardcode an arbitrary version. ## Verification -Confirm that the Operator installed with the NVIDIA AI Enterprise components and that licensing succeeded: - -1. Confirm the Operator pods are running: - - ```console - $ kubectl get pods -n gpu-operator - ``` - - The driver pods should report `Running` and the `nvidia-operator-validator` pod should report `Completed`. - -1. Confirm the driver acquired a valid license: - - ```console - $ kubectl exec -it -n gpu-operator -- nvidia-smi -q | grep -i "License Status" - ``` - - The license status should report `Licensed`. - -## Related Information - -- [NVIDIA AI Enterprise](https://www.nvidia.com/en-us/data-center/products/ai-enterprise-suite/) web page. +Confirm driver pods are `Running`, `nvidia-operator-validator` is `Completed`, +and `nvidia-smi -q | grep "License Status"` reports `Licensed`. Exact commands +are in [references/datacenter-driver.md](references/datacenter-driver.md). diff --git a/gpu-operator/.agents/skills/gpu-operator-install-nvidia-enterprise/references/concepts.md b/gpu-operator/.agents/skills/gpu-operator-install-nvidia-enterprise/references/concepts.md new file mode 100644 index 000000000..d1712fb92 --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-install-nvidia-enterprise/references/concepts.md @@ -0,0 +1,22 @@ + + + +# About NVIDIA AI Enterprise and Supported Platforms + +NVIDIA AI Enterprise is an end-to-end, cloud-native suite of AI and data analytics software, optimized, certified, and supported by NVIDIA with NVIDIA-Certified Systems. + +Deploying the GPU Operator with NVIDIA AI Enterprise offers two installation options. + +| vGPU Guest Driver | Data Center Driver | +| --- | --- | +| Uses a a prebuilt vGPU driver image that is only available to NVIDIA AI Enterprise customers. It is configured to use the [NVIDIA License System (NLS)](https://docs.nvidia.com/license-system/latest/). Installations on virtualization platforms must use the vGPU driver installation. Installation is performed by downloading a Bash script from NVIDIA NGC and running the script. | Uses the GPU Operator Helm chart that is publicly available and GPU driver containers that are publicly available. You must determine the supported driver branch, such as 550, for your NVIDIA AI Enterprise release. Installation is performed by running the `helm` command. | + +For information about supported platforms, hypervisors, and operating systems, refer to the +[Product Support Matrix](https://docs.nvidia.com/ai-enterprise/latest/product-support-matrix/index.html) +in the NVIDIA AI Enterprise documentation. + +For information about using vGPU with Red Hat OpenShift, refer to [NVIDIA AI Enterprise with OpenShift](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/nvaie-with-ocp.html). + +## Related Information + +- [NVIDIA AI Enterprise](https://www.nvidia.com/en-us/data-center/products/ai-enterprise-suite/) web page. diff --git a/gpu-operator/.agents/skills/gpu-operator-install-nvidia-enterprise/references/datacenter-driver.md b/gpu-operator/.agents/skills/gpu-operator-install-nvidia-enterprise/references/datacenter-driver.md new file mode 100644 index 000000000..21ff0145f --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-install-nvidia-enterprise/references/datacenter-driver.md @@ -0,0 +1,40 @@ + + + +# Installing GPU Operator Using the Data Center Driver + +This installation method is available for bare metal clusters or any cluster that does not use virtualization. + +You must install the driver that matches the supported driver branch for your NVIDIA AI Enterprise release. + +To identify the correct driver branch: + +1. Refer to the [NVIDIA AI Enterprise Infra Release Branches](https://docs.nvidia.com/ai-enterprise/index.html#nvidiatab-infrastructure-software---infra-release-branches) + table to determine the driver branch for your release. + + For example, NVIDIA AI Enterprise Infra 7.x uses the R580 driver branch. + +1. Refer to the [GPU Operator Component Matrix](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/life-cycle-policy.html#gpu-operator-component-matrix) to identify the recommended GPU Operator version and driver version that uses the same driver branch. + +After identifying the correct driver version, use the `gpu-operator-install` skill for installation instructions. +Use the `--version=` argument when installing with Helm. + +## Verification + +Confirm that the Operator installed with the NVIDIA AI Enterprise components and that licensing succeeded: + +1. Confirm the Operator pods are running: + + ```console + $ kubectl get pods -n gpu-operator + ``` + + The driver pods should report `Running` and the `nvidia-operator-validator` pod should report `Completed`. + +1. Confirm the driver acquired a valid license: + + ```console + $ kubectl exec -it -n gpu-operator -- nvidia-smi -q | grep -i "License Status" + ``` + + The license status should report `Licensed`. diff --git a/gpu-operator/.agents/skills/gpu-operator-install-nvidia-enterprise/references/nls-token-update.md b/gpu-operator/.agents/skills/gpu-operator-install-nvidia-enterprise/references/nls-token-update.md new file mode 100644 index 000000000..4d4ae31ef --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-install-nvidia-enterprise/references/nls-token-update.md @@ -0,0 +1,50 @@ + + + +# Updating NLS Client License Token + +In case the NLS client license token needs to be updated, use the following procedure: + +Create an empty vGPU license configuration file: + +```console +$ sudo touch gridd.conf +``` + +Generate and download a new NLS client license token. Refer to Section 4.6 of the [NLS User Guide](https://docs.nvidia.com/license-system/latest/pdf/nvidia-license-system-user-guide.pdf) for instructions. + +Rename the NLS client license token that you downloaded to `client_configuration_token.tok`. + +> [!WARNING] +> The `configMap(configMapName)` is **deprecated** and will be removed in a future release. +> Use `secrets(secretName)` instead. +> Create a new `licensing-config-new` Secret object in the `gpu-operator` namespace (make sure the name of the secret is not already used in the kubernetes cluster). Both the vGPU license configuration file and the NLS client license token will be added to this Secret: + +```console +$ kubectl create secret generic licensing-config-new \ + -n gpu-operator --from-file=gridd.conf --from-file=/client_configuration_token.tok +``` + +Edit the clusterpolicies by using the command: + +```console +$ kubectl edit clusterpolicies.nvidia.com +``` + +Go to the driver section and replace the following argument: + +```console +licensingConfig: + secretName: licensing-config +``` + +with + +```console +licensingConfig: + secretName: licensing-config-new +``` + +Write and exit from the kubectl edit session (you can use :qw for instance if vi utility is used) + +GPU Operator sequentially redeploys all the driver pods with this new licensing information. diff --git a/gpu-operator/.agents/skills/gpu-operator-install-nvidia-enterprise/references/vgpu-driver.md b/gpu-operator/.agents/skills/gpu-operator-install-nvidia-enterprise/references/vgpu-driver.md new file mode 100644 index 000000000..03c475a19 --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-install-nvidia-enterprise/references/vgpu-driver.md @@ -0,0 +1,40 @@ + + + +# Installing GPU Operator Using the vGPU Driver + +## Prerequisites + +- A client configuration token has been generated for the client on which the script will install the vGPU guest driver. + Refer to [Generating a Client Configuration Token](https://docs.nvidia.com/license-system/latest/nvidia-license-system-user-guide/index.html#generating-client-configuration-token) + in the *NVIDIA License System User Guide* for more information. +- An NGC CLI API key that is used to create an image pull secret. + The secret is used to pull the prebuilt vGPU driver image from NVIDIA NGC. + Refer to [Generating Your NGC API Key](https://docs.nvidia.com/ngc/latest/ngc-private-registry-user-guide.html#prug-generating-personal-api-key) + in the *NVIDIA NGC Private Registry User Guide* for more information. + +## Procedure + +1. Export the NGC CLI API key and your email address as environment variables: + + ```console + $ export NGC_API_KEY="M2Vub3QxYmgyZ..." + $ export NGC_USER_EMAIL="user@example.com" + ``` + +1. Go to the + [NVIDIA GPU Operator - Deploy Installer Script](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/vgpu/resources/gpu-operator-installer-5) + web page on NVIDIA NGC. + + Click the **File Browser** tab, identify your NVIDIA AI Enterprise release, click ellipses-img, and select **Download File**. + + Copy the downloaded script to the same directory as the client configuration token. + +1. Rename the client configuration token that you downloaded to `client_configuration_token.tok`. + Originally, the client configuration token is named to match the pattern: `client_configuration_token_mm-dd-yyyy-hh-mm-ss.tok`. + +1. From the directory that contains the downloaded script and the client configuration token, run the script: + + ```console + $ bash gpu-operator-nvaie.sh install + ``` diff --git a/gpu-operator/.agents/skills/gpu-operator-nvidia-azure/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-nvidia-azure/SKILL.md index a2f4d33a7..cc6addf8f 100644 --- a/gpu-operator/.agents/skills/gpu-operator-nvidia-azure/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-nvidia-azure/SKILL.md @@ -20,119 +20,41 @@ tags: # NVIDIA GPU Operator with Azure Kubernetes Service +Deploy the NVIDIA GPU Operator on Azure AKS. AKS GPU images ship with a +preinstalled NVIDIA driver and Container Toolkit, so the right approach depends +on whether you create the node pool with `--skip-gpu-driver-install` (Operator +manages the full lifecycle) or run on the default image (Operator runs with +driver/toolkit deployment disabled). + ## Prerequisites - An Azure subscription and the Azure CLI (`az`) installed and configured. - The `kubectl` and `helm` CLIs available on a client machine. - An AKS cluster with a GPU-enabled node pool that uses a supported operating system. Use a node pool created with `--skip-gpu-driver-install` so that the GPU Operator manages the driver lifecycle. -## Approaches for Working with Azure AKS - -### Create AKS Cluster with a Node Pool to Skip GPU Driver installation - -Azure Kubernetes Service has a preview feature that enables a `--skip-gpu-driver-install` -command-line argument to the `az aks nodepool add` command. -This argument prevents installing -the NVIDIA GPU Driver in the stock Ubuntu operating system. - -This approach enables you to take advantage of the lifecycle management -that the NVIDIA GPU Operator provides for managing your cluster. - -```console -$ az aks nodepool add --resource-group --name gpunodes --cluster-name \ - --node-count \ - --skip-gpu-driver-install \ - ... -``` - -When you follow this approach, you can install the Operator without any special -considerations or arguments. -Refer to Install NVIDIA GPU Operator. - -For more information about this feature, see -[Skip GPU driver installation](https://learn.microsoft.com/en-us/azure/aks/use-nvidia-gpu?source=recommendations&tabs=add-ubuntu-gpu-node-pool#skip-gpu-driver-installation) -in the Azure Kubernetes Service documentation. - -### Default AKS configuration without the GPU Operator - -By default, you can run Azure AKS images on GPU-enabled virtual machines with NVIDIA GPUs, -and not use the NVIDIA GPU Operator. - -AKS images include a preinstalled NVIDIA GPU Driver and preinstalled NVIDIA Container Toolkit. - -Using the default configuration, without the Operator, has the following limitations: - -* Metrics are not collected or reported with NVIDIA DCGM Exporter. -* Validating the container runtime is manual rather than automatic with the Operator. -* Multi-Instance GPU (MIG) profiles must be set when you create the node pool and you - cannot change the profile at run time. - -If these limitations are acceptable to you, refer to -[Use GPUs for compute-intensive workloads on Azure Kubernetes Services](https://learn.microsoft.com/en-us/azure/aks/gpu-cluster) -in the Microsoft Azure product documentation for information about configuring your cluster. - -### GPU Operator with Preinstalled Driver and Container Toolkit - -The images that are available in AKS always include a preinstalled NVIDIA GPU driver -and a preinstalled NVIDIA Container Toolkit. -These images reduce the primary benefit of installing the Operator so that it can -manage the lifecycle of these software components and others. - -However, using the Operator can overcome the limitations identified in the preceding section. - -## Installing the Operator for Preinstalled Driver and Toolkit - -After you start your Azure AKS cluster with an image that includes a preinstalled NVIDIA GPU Driver -and NVIDIA Container Toolkit, you are ready to install the NVIDIA GPU Operator. - -When you install the Operator, you must prevent the Operator from automatically -deploying NVIDIA Driver Containers and the NVIDIA Container Toolkit. - -1. Add the NVIDIA Helm repository: - - ```console - $ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \ - && helm repo update - ``` - -> [!NOTE] -> Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). - -1. Install the Operator without the driver containers and toolkit: - - ```console - $ helm install gpu-operator nvidia/gpu-operator \ - -n gpu-operator --create-namespace \ - --version= \ - --set driver.enabled=false \ - --set toolkit.enabled=false \ - --set operator.runtimeClass=nvidia-container-runtime - ``` - - Refer to Common Chart Customization Options for more information about installation options. +## Activation - *Example Output* +Do this first: pick the approach matching your AKS node pool from the Phases +table below, then **read the corresponding `references/.md` file before +acting**. All command sequences and expected output live only in those +reference files — do not improvise commands from this dispatch layer. - ```output - NAME: gpu-operator - LAST DEPLOYED: Fri May 5 15:30:05 2023 - NAMESPACE: gpu-operator - STATUS: deployed - REVISION: 1 - TEST SUITE: None - ``` +## Phases - The Operator requires several minutes to install. +| Phase | Summary | Reference | +|-------|---------|-----------| +| Approaches | The three AKS options: `--skip-gpu-driver-install` node pool (Operator manages everything), default AKS without the Operator (and its DCGM/validation/MIG limitations), and Operator-with-preinstalled-driver-and-toolkit. | [references/approaches.md](references/approaches.md) | +| Install (preinstalled driver/toolkit) | Add the NVIDIA Helm repo and install the Operator with `driver.enabled=false`, `toolkit.enabled=false`, and `operator.runtimeClass=nvidia-container-runtime`; confirm the CUDA validator completes. | [references/install-preinstalled.md](references/install-preinstalled.md) | -1. Confirm that the Operator is installed and ran the CUDA validation container to completion: +## Hard rules (apply across all phases) - ```console - $ kubectl get pods -n gpu-operator -l app=nvidia-cuda-validator - ``` +- On default AKS images (driver + toolkit preinstalled), install the Operator with `driver.enabled=false` and `toolkit.enabled=false` so it does not redeploy those components. +- For full lifecycle management, create the node pool with `--skip-gpu-driver-install` and install the Operator normally (no special flags). +- Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). Never hardcode a version. - *Example Output* +## Verification - ```output - NAME READY STATUS RESTARTS AGE - nvidia-cuda-validator-bpvkt 0/1 Completed 0 3m56s - ``` +Confirm the Operator ran the CUDA validation container to completion via +`kubectl get pods -n gpu-operator -l app=nvidia-cuda-validator` (expect +`Completed`). Exact commands are in +[references/install-preinstalled.md](references/install-preinstalled.md). diff --git a/gpu-operator/.agents/skills/gpu-operator-nvidia-azure/references/approaches.md b/gpu-operator/.agents/skills/gpu-operator-nvidia-azure/references/approaches.md new file mode 100644 index 000000000..03f3bf6e5 --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-nvidia-azure/references/approaches.md @@ -0,0 +1,56 @@ + + + +# Approaches for Working with Azure AKS + +## Create AKS Cluster with a Node Pool to Skip GPU Driver installation + +Azure Kubernetes Service has a preview feature that enables a `--skip-gpu-driver-install` +command-line argument to the `az aks nodepool add` command. +This argument prevents installing +the NVIDIA GPU Driver in the stock Ubuntu operating system. + +This approach enables you to take advantage of the lifecycle management +that the NVIDIA GPU Operator provides for managing your cluster. + +```console +$ az aks nodepool add --resource-group --name gpunodes --cluster-name \ + --node-count \ + --skip-gpu-driver-install \ + ... +``` + +When you follow this approach, you can install the Operator without any special +considerations or arguments. +Refer to Install NVIDIA GPU Operator. + +For more information about this feature, see +[Skip GPU driver installation](https://learn.microsoft.com/en-us/azure/aks/use-nvidia-gpu?source=recommendations&tabs=add-ubuntu-gpu-node-pool#skip-gpu-driver-installation) +in the Azure Kubernetes Service documentation. + +## Default AKS configuration without the GPU Operator + +By default, you can run Azure AKS images on GPU-enabled virtual machines with NVIDIA GPUs, +and not use the NVIDIA GPU Operator. + +AKS images include a preinstalled NVIDIA GPU Driver and preinstalled NVIDIA Container Toolkit. + +Using the default configuration, without the Operator, has the following limitations: + +* Metrics are not collected or reported with NVIDIA DCGM Exporter. +* Validating the container runtime is manual rather than automatic with the Operator. +* Multi-Instance GPU (MIG) profiles must be set when you create the node pool and you + cannot change the profile at run time. + +If these limitations are acceptable to you, refer to +[Use GPUs for compute-intensive workloads on Azure Kubernetes Services](https://learn.microsoft.com/en-us/azure/aks/gpu-cluster) +in the Microsoft Azure product documentation for information about configuring your cluster. + +## GPU Operator with Preinstalled Driver and Container Toolkit + +The images that are available in AKS always include a preinstalled NVIDIA GPU driver +and a preinstalled NVIDIA Container Toolkit. +These images reduce the primary benefit of installing the Operator so that it can +manage the lifecycle of these software components and others. + +However, using the Operator can overcome the limitations identified in the preceding section. diff --git a/gpu-operator/.agents/skills/gpu-operator-nvidia-azure/references/install-preinstalled.md b/gpu-operator/.agents/skills/gpu-operator-nvidia-azure/references/install-preinstalled.md new file mode 100644 index 000000000..d1c5b5309 --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-nvidia-azure/references/install-preinstalled.md @@ -0,0 +1,59 @@ + + + +# Installing the Operator for Preinstalled Driver and Toolkit + +After you start your Azure AKS cluster with an image that includes a preinstalled NVIDIA GPU Driver +and NVIDIA Container Toolkit, you are ready to install the NVIDIA GPU Operator. + +When you install the Operator, you must prevent the Operator from automatically +deploying NVIDIA Driver Containers and the NVIDIA Container Toolkit. + +1. Add the NVIDIA Helm repository: + + ```console + $ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \ + && helm repo update + ``` + + > [!NOTE] + > Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). + +1. Install the Operator without the driver containers and toolkit: + + ```console + $ helm install gpu-operator nvidia/gpu-operator \ + -n gpu-operator --create-namespace \ + --version= \ + --set driver.enabled=false \ + --set toolkit.enabled=false \ + --set operator.runtimeClass=nvidia-container-runtime + ``` + + Refer to Common Chart Customization Options for more information about installation options. + + *Example Output* + + ```output + NAME: gpu-operator + LAST DEPLOYED: Fri May 5 15:30:05 2023 + NAMESPACE: gpu-operator + STATUS: deployed + REVISION: 1 + TEST SUITE: None + ``` + + The Operator requires several minutes to install. + +1. Confirm that the Operator is installed and ran the CUDA validation container to completion: + + ```console + $ kubectl get pods -n gpu-operator -l app=nvidia-cuda-validator + ``` + + *Example Output* + + ```output + NAME READY STATUS RESTARTS AGE + nvidia-cuda-validator-bpvkt 0/1 Completed 0 3m56s + ``` From 4f233dc6f45684cee752e956089521eac8f449bd Mon Sep 17 00:00:00 2001 From: Andrew Chen Date: Mon, 1 Jun 2026 21:55:03 -0700 Subject: [PATCH 11/13] refactor(skills): info-hiding restructure batch 3 (container-device, EKS, GKE) Restructure 3 procedural GPU Operator skills to the dispatch-layer information-hiding pattern: thin SKILL.md (<200 lines) with all step-by-step detail moved into phase-specific references/*.md. Skills: container-device (CDI, NRI, verification), nvidia-amazon (approaches, eksctl-example), nvidia-google (approaches, google-driver-installer, nvidia-driver-manager). All three pass the close-your-eyes test. Verified-fix content preserved ( placeholders, cross-refs to gpu-operator-install/-references skills, verification sections). Co-Authored-By: Claude Opus 4.8 (1M context) Signed-off-by: Andrew Chen --- .../gpu-operator-container-device/SKILL.md | 214 ++------------ .../references/cdi.md | 111 ++++++++ .../references/nri.md | 74 +++++ .../references/verification.md | 23 ++ .../gpu-operator-nvidia-amazon/SKILL.md | 218 ++------------ .../references/approaches.md | 88 ++++++ .../references/eksctl-example.md | 119 ++++++++ .../gpu-operator-nvidia-google/SKILL.md | 269 ++---------------- .../references/approaches.md | 29 ++ .../references/google-driver-installer.md | 117 ++++++++ .../references/nvidia-driver-manager.md | 112 ++++++++ 11 files changed, 748 insertions(+), 626 deletions(-) create mode 100644 gpu-operator/.agents/skills/gpu-operator-container-device/references/cdi.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-container-device/references/nri.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-container-device/references/verification.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-nvidia-amazon/references/approaches.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-nvidia-amazon/references/eksctl-example.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-nvidia-google/references/approaches.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-nvidia-google/references/google-driver-installer.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-nvidia-google/references/nvidia-driver-manager.md diff --git a/gpu-operator/.agents/skills/gpu-operator-container-device/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-container-device/SKILL.md index 39b39a59d..667ce2298 100644 --- a/gpu-operator/.agents/skills/gpu-operator-container-device/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-container-device/SKILL.md @@ -22,7 +22,11 @@ tags: # Container Device Interface (CDI) and Node Resource Interface (NRI) Plugin Support -This page gives an overview of CDI and NRI Plugin support in the GPU Operator. +Configure how the GPU Operator injects GPUs into containers. The **Container +Device Interface (CDI)** is the default, runtime-agnostic injection mechanism +(default-on since GPU Operator v25.10.0). The **Node Resource Interface (NRI) +Plugin** is an optional containerd extension that injects GPUs into GPU +management containers without requiring `runtimeClassName: nvidia`. ## Prerequisites @@ -30,199 +34,31 @@ This page gives an overview of CDI and NRI Plugin support in the GPU Operator. - The NVIDIA GPU Operator installed (use the `gpu-operator-install` skill). - A container runtime that supports CDI. CDI is enabled by default starting with GPU Operator v25.10.0. The NRI Plugin requires containerd v1.7.30, v2.1.x, or v2.2.x and is not supported with CRI-O. -## About Container Device Interface (CDI) +## Activation -The [Container Device Interface (CDI)](https://github.com/cncf-tags/container-device-interface/blob/main/SPEC.md) -is an open specification for container runtimes that abstracts what access to a device, such as an NVIDIA GPU, means, -and standardizes access across container runtimes. Popular container runtimes can read and process the specification to -ensure that a device is available in a container. CDI simplifies adding support for devices such as NVIDIA GPUs because -the specification is applicable to all container runtimes that support CDI. +Do this first: identify which phase the user's request maps to in the Phases +table below, then **read the corresponding `references/.md` file before +acting**. All `kubectl patch` cluster-policy commands, Helm flags, and +verification output live only in those reference files — do not improvise +commands from this dispatch layer. -Starting with GPU Operator v25.10.0, CDI is used by default for enabling GPU support in containers running on Kubernetes. -Specifically, CDI support in container runtimes, like containerd and cri-o, is used to inject GPU(s) into workload -containers. This differs from prior GPU Operator releases where CDI was used via a CDI-enabled `nvidia` runtime class. +## Phases -If you are upgrading from a version of the GPU Operator prior to v25.10.0, where CDI was disabled by default, and you are upgrading to v25.10.0 or later, where CDI is enabled by default, no configuration changes are required for standard workloads using GPU allocation through the Device Plugin. -For workloads that already have `runtimeClassName: nvidia` set in their pod spec YAML, no change is necessary. +| Phase | Summary | Reference | +|-------|---------|-----------| +| CDI | What CDI is, its interaction with GPU Management Containers and `NVIDIA_VISIBLE_DEVICES`, and how to enable CDI after install or disable it (including the CRI-O validator toggle). | [references/cdi.md](references/cdi.md) | +| NRI Plugin | What the NRI Plugin is, its containerd requirements, how it removes the need for the `nvidia` runtime class / containerd config edits, and how to enable/disable it (install flag `cdi.nriPluginEnabled=true` or cluster-policy patch). | [references/nri.md](references/nri.md) | +| Verification | Confirm the toolkit and device-plugin daemonsets are `Running` and run a CUDA sample that reports `Test PASSED`. | [references/verification.md](references/verification.md) | -Use of CDI is transparent to cluster administrators and application developers. -The benefits of CDI are largely to reduce development and support for runtime-specific -plugins. +## Hard rules (apply across all phases) -### CDI and GPU Management Containers - -When CDI is enabled in GPU Operator versions v25.10.0 and later, GPU Management Containers that use the `NVIDIA_VISIBLE_DEVICES` environment variable to get GPU access, bypassing GPU allocation via the Device Plugin or DRA Driver for GPUs, must set `runtimeClassName: nvidia` in the pod specification. -A GPU Management Container is a container that requires access to all GPUs without them being allocated by Kubernetes. -Examples of GPU Management Containers include monitoring agents and device plugins. - -It is recommended that `NVIDIA_VISIBLE_DEVICES` only be used by GPU Management Containers. - -> [!NOTE] -> Setting `runtimeClassName: nvidia` in the pod specification is not required when the NRI Plugin is enabled in GPU Operator. -> Refer to About the Node Resource Interface (NRI) Plugin. - -## Enabling CDI - -CDI is enabled by default during installation in GPU Operator v25.10.0 and later. -Follow the instructions for installing the Operator with Helm on the getting-started page. - -CDI is also enabled by default during a Helm upgrade to GPU Operator v25.10.0 and later. - -### Enabling CDI After Installation - -CDI is enabled by default in GPU Operator v25.10.0 and later. -Use the following procedure to enable CDI if you disabled CDI during installation. - -### Procedure -1. Enable CDI by modifying the cluster policy: - - ```console - $ kubectl patch clusterpolicies.nvidia.com/cluster-policy --type='json' \ - -p='[{"op": "replace", "path": "/spec/cdi/enabled", "value":true}]' - ``` - - *Example Output* - - ```output - clusterpolicy.nvidia.com/cluster-policy patched - ``` - -1. (Optional) Confirm that the container toolkit and device plugin pods restart: - - ```console - $ kubectl get pods -n gpu-operator - ``` - - *Example Output* - -## Disabling CDI - -While CDI is the default and recommended mechanism for injecting GPU support into containers, you can -disable CDI and use the legacy NVIDIA Container Toolkit stack instead with the following procedure: - -1. If your nodes use the CRI-O container runtime, then temporarily disable the - GPU Operator validator: - - ```console - $ kubectl label nodes \ - nvidia.com/gpu.deploy.operator-validator=false \ - -l nvidia.com/gpu.present=true \ - --overwrite - ``` - - > [!TIP] - > You can run `kubectl get nodes -o wide` and view the `CONTAINER-RUNTIME` - > column to determine if your nodes use CRI-O. - - 1. Disable CDI by modifying the cluster policy: - - ```console - $ kubectl patch clusterpolicies.nvidia.com/cluster-policy --type='json' \ - -p='[{"op": "replace", "path": "/spec/cdi/enabled", "value":false}]' - ``` - - *Example Output* - - ```output - clusterpolicy.nvidia.com/cluster-policy patched - ``` - -1. If you temporarily disabled the GPU Operator validator, re-enable the validator: - - ```console - $ kubectl label nodes \ - nvidia.com/gpu.deploy.operator-validator=true \ - nvidia.com/gpu.present=true \ - --overwrite - ``` - -## About the Node Resource Interface (NRI) Plugin - -Node Resource Interface (NRI) is a standardized interface for plugging in extensions, called NRI Plugins, to OCI-compatible container runtimes like containerd. -NRI Plugins serve as hooks which intercept pod and container lifecycle events and perform functions including injecting devices to a container, topology aware placement strategies, and more. For more details on NRI, refer to the [NRI overview](https://github.com/containerd/nri/tree/main?tab=readme-ov-file#background) in the containerd repository. - -When enabled in the GPU Operator, the NVIDIA Container Toolkit daemonset will run an NRI Plugin on every GPU node. -The purpose of the NRI Plugin is to inject GPUs into GPU management containers that use the `NVIDIA_VISIBLE_DEVICES` environment variable to get GPU access, bypassing GPU allocation via the Device Plugin or DRA Driver for GPUs. - -In previous GPU Operator versions, device injection was handled by the `nvidia` container runtime. With CDI and the NRI Plugin enabled, the `nvidia` runtime class is no longer needed. When enabling the NRI plugin during install, the `nvidia` runtime class will not be created. If you enable the NRI Plugin after install, the `nvidia` runtime class will be deleted. - -Additionally, with the NRI Plugin enabled, modifications to the container runtime configuration are no longer needed. For example, no modifications are made to containerd’s config.toml file. -This means that on platforms that configure containerd in a non-standard way, like k3s, k0s, and Rancher Kubernetes Engine 2, users no longer need to configure environment variables like `CONTAINERD_CONFIG`, `CONTAINERD_SOCKET`, or `RUNTIME_CONFIG_SOURCE`. - -## Enabling the NRI Plugin - -The NRI Plugin requires the following: - -- CDI to be enabled in the GPU Operator. - -- containerd v1.7.30, v2.1.x, or v2.2.x. - If you are not using the latest containerd version, check that both CDI and NRI are enabled in the containerd configuration file before deploying GPU Operator. - - > [!NOTE] - > Enabling the NRI plugin is not supported with cri-o. - > To enable the NRI Plugin during installation, follow the instructions for installing the Operator with Helm on the getting-started page and include the `--set cdi.nriPluginEnabled=true` argument in your Helm command. - -### Enabling the NRI Plugin After Installation - -1. Enable NRI Plugin by modifying the cluster policy: - - ```console - $ kubectl patch clusterpolicies.nvidia.com/cluster-policy --type='json' \ - -p='[{"op": "replace", "path": "/spec/cdi/nriPluginEnabled", "value":true}]' - ``` - - *Example Output* - - ```output - clusterpolicy.nvidia.com/cluster-policy patched - ``` - - After enabling the NRI Plugin, the `nvidia` runtime class will be deleted. - -1. (Optional) Confirm that the container toolkit and device plugin pods restart: - - ```console - $ kubectl get pods -n gpu-operator - ``` - - *Example Output* - -## Disabling the NRI Plugin - -Disable the NRI Plugin and use the `nvidia` runtime class instead with the following procedure: - -Disable the NRI Plugin by modifying the cluster policy: - -```console -$ kubectl patch clusterpolicies.nvidia.com/cluster-policy --type='json' \ - -p='[{"op": "replace", "path": "/spec/cdi/nriPluginEnabled", "value":false}]' -``` - -*Example Output* - -```output -clusterpolicy.nvidia.com/cluster-policy patched -``` - -After disabling the NRI Plugin, the `nvidia` runtime class will be created. +- CDI is the default and recommended injection mechanism since GPU Operator v25.10.0; disabling it reverts to the legacy NVIDIA Container Toolkit stack. +- The NRI Plugin requires CDI enabled and a supported containerd version; it is not supported with CRI-O. +- When CDI is enabled and the NRI Plugin is not, GPU Management Containers using `NVIDIA_VISIBLE_DEVICES` must set `runtimeClassName: nvidia`. +- Enabling the NRI Plugin deletes the `nvidia` runtime class; disabling it recreates the class. ## Verification -Confirm that CDI or the NRI Plugin is configured as expected: - -1. Confirm the GPU Operator pods, including the container toolkit and device plugin, are running: - - ```console - $ kubectl get pods -n gpu-operator - ``` - - The `nvidia-container-toolkit-daemonset` and `nvidia-device-plugin-daemonset` pods should report `Running`. - -1. Run a GPU workload and confirm the GPU is injected into the container: - - ```console - $ kubectl run cuda-check --rm -it --restart=Never \ - --image=nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04 - ``` - - A successful run reports `Test PASSED`, confirming that the device was injected through CDI or the NRI Plugin. +Confirm the container-toolkit and device-plugin daemonsets are `Running` and a +CUDA sample reports `Test PASSED`. Exact commands are in +[references/verification.md](references/verification.md). diff --git a/gpu-operator/.agents/skills/gpu-operator-container-device/references/cdi.md b/gpu-operator/.agents/skills/gpu-operator-container-device/references/cdi.md new file mode 100644 index 000000000..8fa041b39 --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-container-device/references/cdi.md @@ -0,0 +1,111 @@ + + + +# Container Device Interface (CDI) + +## About Container Device Interface (CDI) + +The [Container Device Interface (CDI)](https://github.com/cncf-tags/container-device-interface/blob/main/SPEC.md) +is an open specification for container runtimes that abstracts what access to a device, such as an NVIDIA GPU, means, +and standardizes access across container runtimes. Popular container runtimes can read and process the specification to +ensure that a device is available in a container. CDI simplifies adding support for devices such as NVIDIA GPUs because +the specification is applicable to all container runtimes that support CDI. + +Starting with GPU Operator v25.10.0, CDI is used by default for enabling GPU support in containers running on Kubernetes. +Specifically, CDI support in container runtimes, like containerd and cri-o, is used to inject GPU(s) into workload +containers. This differs from prior GPU Operator releases where CDI was used via a CDI-enabled `nvidia` runtime class. + +If you are upgrading from a version of the GPU Operator prior to v25.10.0, where CDI was disabled by default, and you are upgrading to v25.10.0 or later, where CDI is enabled by default, no configuration changes are required for standard workloads using GPU allocation through the Device Plugin. +For workloads that already have `runtimeClassName: nvidia` set in their pod spec YAML, no change is necessary. + +Use of CDI is transparent to cluster administrators and application developers. +The benefits of CDI are largely to reduce development and support for runtime-specific +plugins. + +### CDI and GPU Management Containers + +When CDI is enabled in GPU Operator versions v25.10.0 and later, GPU Management Containers that use the `NVIDIA_VISIBLE_DEVICES` environment variable to get GPU access, bypassing GPU allocation via the Device Plugin or DRA Driver for GPUs, must set `runtimeClassName: nvidia` in the pod specification. +A GPU Management Container is a container that requires access to all GPUs without them being allocated by Kubernetes. +Examples of GPU Management Containers include monitoring agents and device plugins. + +It is recommended that `NVIDIA_VISIBLE_DEVICES` only be used by GPU Management Containers. + +> [!NOTE] +> Setting `runtimeClassName: nvidia` in the pod specification is not required when the NRI Plugin is enabled in GPU Operator. +> Refer to About the Node Resource Interface (NRI) Plugin. + +## Enabling CDI + +CDI is enabled by default during installation in GPU Operator v25.10.0 and later. +Follow the instructions for installing the Operator with Helm on the getting-started page. + +CDI is also enabled by default during a Helm upgrade to GPU Operator v25.10.0 and later. + +### Enabling CDI After Installation + +CDI is enabled by default in GPU Operator v25.10.0 and later. +Use the following procedure to enable CDI if you disabled CDI during installation. + +#### Procedure + +1. Enable CDI by modifying the cluster policy: + + ```console + $ kubectl patch clusterpolicies.nvidia.com/cluster-policy --type='json' \ + -p='[{"op": "replace", "path": "/spec/cdi/enabled", "value":true}]' + ``` + + *Example Output* + + ```output + clusterpolicy.nvidia.com/cluster-policy patched + ``` + +1. (Optional) Confirm that the container toolkit and device plugin pods restart: + + ```console + $ kubectl get pods -n gpu-operator + ``` + + *Example Output* + +## Disabling CDI + +While CDI is the default and recommended mechanism for injecting GPU support into containers, you can +disable CDI and use the legacy NVIDIA Container Toolkit stack instead with the following procedure: + +1. If your nodes use the CRI-O container runtime, then temporarily disable the + GPU Operator validator: + + ```console + $ kubectl label nodes \ + nvidia.com/gpu.deploy.operator-validator=false \ + -l nvidia.com/gpu.present=true \ + --overwrite + ``` + + > [!TIP] + > You can run `kubectl get nodes -o wide` and view the `CONTAINER-RUNTIME` + > column to determine if your nodes use CRI-O. + + 1. Disable CDI by modifying the cluster policy: + + ```console + $ kubectl patch clusterpolicies.nvidia.com/cluster-policy --type='json' \ + -p='[{"op": "replace", "path": "/spec/cdi/enabled", "value":false}]' + ``` + + *Example Output* + + ```output + clusterpolicy.nvidia.com/cluster-policy patched + ``` + +1. If you temporarily disabled the GPU Operator validator, re-enable the validator: + + ```console + $ kubectl label nodes \ + nvidia.com/gpu.deploy.operator-validator=true \ + nvidia.com/gpu.present=true \ + --overwrite + ``` diff --git a/gpu-operator/.agents/skills/gpu-operator-container-device/references/nri.md b/gpu-operator/.agents/skills/gpu-operator-container-device/references/nri.md new file mode 100644 index 000000000..122921fc0 --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-container-device/references/nri.md @@ -0,0 +1,74 @@ + + + +# Node Resource Interface (NRI) Plugin + +## About the Node Resource Interface (NRI) Plugin + +Node Resource Interface (NRI) is a standardized interface for plugging in extensions, called NRI Plugins, to OCI-compatible container runtimes like containerd. +NRI Plugins serve as hooks which intercept pod and container lifecycle events and perform functions including injecting devices to a container, topology aware placement strategies, and more. For more details on NRI, refer to the [NRI overview](https://github.com/containerd/nri/tree/main?tab=readme-ov-file#background) in the containerd repository. + +When enabled in the GPU Operator, the NVIDIA Container Toolkit daemonset will run an NRI Plugin on every GPU node. +The purpose of the NRI Plugin is to inject GPUs into GPU management containers that use the `NVIDIA_VISIBLE_DEVICES` environment variable to get GPU access, bypassing GPU allocation via the Device Plugin or DRA Driver for GPUs. + +In previous GPU Operator versions, device injection was handled by the `nvidia` container runtime. With CDI and the NRI Plugin enabled, the `nvidia` runtime class is no longer needed. When enabling the NRI plugin during install, the `nvidia` runtime class will not be created. If you enable the NRI Plugin after install, the `nvidia` runtime class will be deleted. + +Additionally, with the NRI Plugin enabled, modifications to the container runtime configuration are no longer needed. For example, no modifications are made to containerd’s config.toml file. +This means that on platforms that configure containerd in a non-standard way, like k3s, k0s, and Rancher Kubernetes Engine 2, users no longer need to configure environment variables like `CONTAINERD_CONFIG`, `CONTAINERD_SOCKET`, or `RUNTIME_CONFIG_SOURCE`. + +## Enabling the NRI Plugin + +The NRI Plugin requires the following: + +- CDI to be enabled in the GPU Operator. + +- containerd v1.7.30, v2.1.x, or v2.2.x. + If you are not using the latest containerd version, check that both CDI and NRI are enabled in the containerd configuration file before deploying GPU Operator. + + > [!NOTE] + > Enabling the NRI plugin is not supported with cri-o. + > To enable the NRI Plugin during installation, follow the instructions for installing the Operator with Helm on the getting-started page and include the `--set cdi.nriPluginEnabled=true` argument in your Helm command. + +### Enabling the NRI Plugin After Installation + +1. Enable NRI Plugin by modifying the cluster policy: + + ```console + $ kubectl patch clusterpolicies.nvidia.com/cluster-policy --type='json' \ + -p='[{"op": "replace", "path": "/spec/cdi/nriPluginEnabled", "value":true}]' + ``` + + *Example Output* + + ```output + clusterpolicy.nvidia.com/cluster-policy patched + ``` + + After enabling the NRI Plugin, the `nvidia` runtime class will be deleted. + +1. (Optional) Confirm that the container toolkit and device plugin pods restart: + + ```console + $ kubectl get pods -n gpu-operator + ``` + + *Example Output* + +## Disabling the NRI Plugin + +Disable the NRI Plugin and use the `nvidia` runtime class instead with the following procedure: + +Disable the NRI Plugin by modifying the cluster policy: + +```console +$ kubectl patch clusterpolicies.nvidia.com/cluster-policy --type='json' \ + -p='[{"op": "replace", "path": "/spec/cdi/nriPluginEnabled", "value":false}]' +``` + +*Example Output* + +```output +clusterpolicy.nvidia.com/cluster-policy patched +``` + +After disabling the NRI Plugin, the `nvidia` runtime class will be created. diff --git a/gpu-operator/.agents/skills/gpu-operator-container-device/references/verification.md b/gpu-operator/.agents/skills/gpu-operator-container-device/references/verification.md new file mode 100644 index 000000000..c60d4fc89 --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-container-device/references/verification.md @@ -0,0 +1,23 @@ + + + +# Verification + +Confirm that CDI or the NRI Plugin is configured as expected: + +1. Confirm the GPU Operator pods, including the container toolkit and device plugin, are running: + + ```console + $ kubectl get pods -n gpu-operator + ``` + + The `nvidia-container-toolkit-daemonset` and `nvidia-device-plugin-daemonset` pods should report `Running`. + +1. Run a GPU workload and confirm the GPU is injected into the container: + + ```console + $ kubectl run cuda-check --rm -it --restart=Never \ + --image=nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04 + ``` + + A successful run reports `Test PASSED`, confirming that the device was injected through CDI or the NRI Plugin. diff --git a/gpu-operator/.agents/skills/gpu-operator-nvidia-amazon/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-nvidia-amazon/SKILL.md index 836fc55ce..39f2a67a2 100644 --- a/gpu-operator/.agents/skills/gpu-operator-nvidia-amazon/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-nvidia-amazon/SKILL.md @@ -21,211 +21,41 @@ tags: # NVIDIA GPU Operator with Amazon EKS +Deploy the NVIDIA GPU Operator on Amazon EKS. The recommended approach is to +create a GPU node group on an Operator-supported AMI (Ubuntu) so the Operator +manages the full driver/toolkit/device-plugin lifecycle, rather than relying on +the default Amazon Linux AMI's preinstalled (lagging) driver. + ## Prerequisites - An AWS account, plus the AWS CLI and `eksctl` installed and configured (see the per-example prerequisites below for details). - The `kubectl` and `helm` CLIs available on a client machine. - An Amazon EKS cluster, or the ability to create one, with a GPU-enabled node group that uses an AMI with an operating system that the GPU Operator supports. -## Approaches for Working with Amazon EKS - -You can approach running workloads in Amazon EKS with NVIDIA GPUs in at least two ways. - -### Default EKS configuration without the GPU Operator - -By default, you can run Amazon EKS optimized Amazon Linux AMIs on instance types -that support NVIDIA GPUs. - -Using the default configuration has the following limitations: - -* The pre-installed NVIDIA GPU driver version and NVIDIA container runtime version - lags the release schedule from NVIDIA. -* You must deploy the NVIDIA device plugin and you assume responsibility for - upgrading the plugin. - -If these limitations are acceptable to you, refer to -[Amazon EKS optimized Amazon Linux AMIs](https://docs.aws.amazon.com/eks/latest/userguide/eks-optimized-ami.html) -in the Amazon EKS documentation for information about configuring your cluster. -You do not need to install the NVIDIA GPU Operator. - -### EKS Node Group with the GPU Operator - -To overcome the limitations with the first approach, you can create a node group for your cluster. -Configure the node group with instance types that have -NVIDIA GPUs and use an AMI with an operating system that the GPU Operator supports. -The Operator does not support a mix of some nodes running Amazon Linux 2 and others -running a supported operating system in the same cluster. - -In this case, the Operator manages the lifecycle of all the operands, including -the NVIDIA GPU driver containers. -This approach enables you to run the most recent NVIDIA GPU drivers and use the -Operator to manage upgrades of the driver and other software components such as -the NVIDIA device plugin, NVIDIA Container Toolkit, and NVIDIA MIG Manager. - -This approach provides the most up-to-date software and the Operator reduces -the administrative overhead. - -### EKS Node Groups in Brief and Client Applications - -When you configure an Amazon EKS node group, you can configure -[self-managed nodes](https://docs.aws.amazon.com/eks/latest/userguide/worker.html) -or [managed nodes groups](https://docs.aws.amazon.com/eks/latest/userguide/managed-node-groups.html). - -Amazon EKS supports many clients for creating a node group. - -For self-managed nodes, you can use the `eksctl` CLI or Amazon Management Console. -Refer to the preceding URL for concepts and procedures. - -For managed node groups, you can use the Amazon Management Console. -The Amazon EKS documentation describes how to use the `eksctl` CLI, -but the CLI does not support operating systems other than Amazon Linux 2 and -the Operator does not support that operating system. -Refer to the preceding URL for concepts and procedures. - -Terraform supports creating self-managed and managed node groups. -Refer to -[AWS EKS Terraform module](https://registry.terraform.io/modules/terraform-aws-modules/eks/aws/latest) -in the Terraform Registry for more information. - -## About Using the Operator with Amazon EKS - -To use the NVIDIA GPU Operator with Amazon Elastic Kubernetes Service (EKS) -without any limitations, you perform the following high-level actions: - -* Create a self-managed or managed node group with instance types that have NVIDIA GPUs. - - Refer to the following resources in the Amazon EC2 documentation to help you choose - the instance type to meet your needs: - - * Table of accelerated computing - [instance types](https://aws.amazon.com/ec2/instance-types/accelerated-computing/) - for information about GPU model and count, RAM, and storage. - - * [Maximum IP addresses per network interface](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AvailableIpPerENI.html) - for accelerated computing instance types. - Make sure the instance type supports enough IP addresses for your workload. - For example, the `g4dn.xlarge` instance type supports `29` IP addresses for pods on the node. - -* Use an Amazon EKS optimized Amazon Machine Image (AMI) with a supported operating system (use the `gpu-operator-references` skill) on the nodes in the node group. +## Activation - AMIs support are specific to an AWS region and Kubernetes version. - See https://cloud-images.ubuntu.com/aws-eks/ for the AMI values such as `ami-00687acd80b7a620a`. +Do this first: identify which phase the user's request maps to in the Phases +table below, then **read the corresponding `references/.md` file before +acting**. All `eksctl`/`kubectl` command sequences, the `ClusterConfig` YAML, +and verification output live only in those reference files — do not improvise +commands from this dispatch layer. -* Use your preferred client application to create the node group. +## Phases -## Example: Create a Self-Managed Node Group with eksctl +| Phase | Summary | Reference | +|-------|---------|-----------| +| Approaches | The two EKS options (default Amazon Linux without the Operator vs a GPU node group with the Operator), node-group/client-application choices (self-managed vs managed; eksctl/Console/Terraform), and the high-level steps + instance-type/AMI selection guidance. | [references/approaches.md](references/approaches.md) | +| eksctl example | A worked self-managed-node-group example: the `cluster-config.yaml` (`ClusterConfig`), `eksctl create cluster`, and post-install verification that GPU nodes advertise capacity and the validator completes. | [references/eksctl-example.md](references/eksctl-example.md) | -### Prerequisites +## Hard rules (apply across all phases) -* You have access to the Amazon Management Console or you installed and configured the AWS CLI. - Refer to - [Installing or updating to the latest version of the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) - and [Configuring the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) - in the AWS CLI documentation. -* You installed the `eksctl` CLI if you prefer it as your client application. - The CLI is available from https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html#eksctl-install-update. -* You have the AMI value from https://cloud-images.ubuntu.com/aws-eks/. -* You have the EC2 instance type to use for your nodes. - -### Procedure - -The following steps show how to create an Amazon EKS cluster with the `eksctl` CLI. -The steps create a self-managed node group that uses an Amazon EKS optimized AMI. - -1. Create a file, such as `cluster-config.yaml`, with contents like the following example: - - ```yaml - apiVersion: eksctl.io/v1alpha5 - kind: ClusterConfig - metadata: - name: demo-cluster - region: us-west-2 - version: "1.25" - nodeGroups: - - name: demo-gpu-workers - instanceType: g4dn.xlarge - ami: ami-0770ab88ec35aa875 - amiFamily: Ubuntu2004 - minSize: 1 - desiredCapacity: 3 - maxSize: 3 - volumeSize: 100 - overrideBootstrapCommand: | - #!/bin/bash - source /var/lib/cloud/scripts/eksctl/bootstrap.helper.sh - /etc/eks/bootstrap.sh ${CLUSTER_NAME} --container-runtime containerd --kubelet-extra-args "--node-labels=${NODE_LABELS}" - ssh: - allow: true - publicKeyPath: ~/.ssh/id_rsa.pub - ``` - - Replace the values for the cluster name, Kubernetes version, and so on. - To resolve the environment variables in the override bootstrap command, you must source the bootstrap helper script. - - > [!TIP] - > The default volume size for each node is 20 GB. - > In many cases, containers with frameworks for AI/ML workloads are often very large. - > The sample YAML file specifies a 100 GB volume to ensure enough local disk space for containers. - - 1. Create the Amazon EKS cluster with the node group: - - ```console - $ eksctl create cluster -f cluster-config.yaml - ``` - - Creating the cluster requires several minutes. - - *Example Output* - - ```output - 2022-08-19 17:51:04 [i] eksctl version 0.105.0 - 2022-08-19 17:51:04 [i] using region us-west-2 - 2022-08-19 17:51:04 [i] setting availability zones to [us-west-2d us-west-2c us-west-2a] - 2022-08-19 17:51:04 [i] subnets for us-west-2d - public:192.168.0.0/19 private:192.168.96.0/19 - ... - [✓] EKS cluster "demo-cluster" in "us-west-2" region is ready - ``` - -1. Optional: View the cluster name: - - ```console - $ eksctl get cluster - ``` - - *Example Output* - - ```output - NAME REGION EKSCTL CREATED - demo-cluster us-west-2 True - ``` +- The GPU Operator does not support Amazon Linux 2; use a supported AMI (Ubuntu 20.04/22.04) and do not mix Amazon Linux 2 nodes with supported-OS nodes in the same cluster. +- Choose an instance type with enough pod IP addresses for your workload (e.g., `g4dn.xlarge` supports 29). +- AMI values are region- and Kubernetes-version-specific; look them up rather than hardcoding. +- After the node group exists, install the Operator via the `gpu-operator-install` skill. ## Verification -After the node group is created and the GPU Operator is installed on the cluster (use the `gpu-operator-install` skill), confirm that the GPU nodes are managed: - -1. Confirm the GPU nodes advertise GPU capacity: - - ```console - $ kubectl get nodes -o json | jq '.items[].status.capacity."nvidia.com/gpu"' - ``` - - Each GPU node should report a non-null GPU count. - -1. Confirm the GPU Operator pods are running: - - ```console - $ kubectl get pods -n gpu-operator - ``` - - The `nvidia-operator-validator` pod should report `Completed`. - -## Related Information - -* The preceding procedure is derived from - [Getting started with Amazon EKS - eksctl](https://docs.aws.amazon.com/eks/latest/userguide/getting-started-eksctl.html) - in the Amazon EKS documentation. -* If you have an existing Amazon EKS cluster, you can refer to - [Launching self-managed Amazon Linux nodes](https://docs.aws.amazon.com/eks/latest/userguide/launch-workers.html) - in the Amazon EKS documentation to add a self-managed node group to your cluster. - However, all nodes in the cluster must run Ubuntu 20.04 or 22.04. - This documentation includes steps for using the AWS Management Console. +Confirm GPU nodes advertise `nvidia.com/gpu` capacity and the +`nvidia-operator-validator` pod is `Completed`. Exact commands are in +[references/eksctl-example.md](references/eksctl-example.md). diff --git a/gpu-operator/.agents/skills/gpu-operator-nvidia-amazon/references/approaches.md b/gpu-operator/.agents/skills/gpu-operator-nvidia-amazon/references/approaches.md new file mode 100644 index 000000000..49f730075 --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-nvidia-amazon/references/approaches.md @@ -0,0 +1,88 @@ + + + +# Approaches for Working with Amazon EKS + +You can approach running workloads in Amazon EKS with NVIDIA GPUs in at least two ways. + +## Default EKS configuration without the GPU Operator + +By default, you can run Amazon EKS optimized Amazon Linux AMIs on instance types +that support NVIDIA GPUs. + +Using the default configuration has the following limitations: + +* The pre-installed NVIDIA GPU driver version and NVIDIA container runtime version + lags the release schedule from NVIDIA. +* You must deploy the NVIDIA device plugin and you assume responsibility for + upgrading the plugin. + +If these limitations are acceptable to you, refer to +[Amazon EKS optimized Amazon Linux AMIs](https://docs.aws.amazon.com/eks/latest/userguide/eks-optimized-ami.html) +in the Amazon EKS documentation for information about configuring your cluster. +You do not need to install the NVIDIA GPU Operator. + +## EKS Node Group with the GPU Operator + +To overcome the limitations with the first approach, you can create a node group for your cluster. +Configure the node group with instance types that have +NVIDIA GPUs and use an AMI with an operating system that the GPU Operator supports. +The Operator does not support a mix of some nodes running Amazon Linux 2 and others +running a supported operating system in the same cluster. + +In this case, the Operator manages the lifecycle of all the operands, including +the NVIDIA GPU driver containers. +This approach enables you to run the most recent NVIDIA GPU drivers and use the +Operator to manage upgrades of the driver and other software components such as +the NVIDIA device plugin, NVIDIA Container Toolkit, and NVIDIA MIG Manager. + +This approach provides the most up-to-date software and the Operator reduces +the administrative overhead. + +## EKS Node Groups in Brief and Client Applications + +When you configure an Amazon EKS node group, you can configure +[self-managed nodes](https://docs.aws.amazon.com/eks/latest/userguide/worker.html) +or [managed nodes groups](https://docs.aws.amazon.com/eks/latest/userguide/managed-node-groups.html). + +Amazon EKS supports many clients for creating a node group. + +For self-managed nodes, you can use the `eksctl` CLI or Amazon Management Console. +Refer to the preceding URL for concepts and procedures. + +For managed node groups, you can use the Amazon Management Console. +The Amazon EKS documentation describes how to use the `eksctl` CLI, +but the CLI does not support operating systems other than Amazon Linux 2 and +the Operator does not support that operating system. +Refer to the preceding URL for concepts and procedures. + +Terraform supports creating self-managed and managed node groups. +Refer to +[AWS EKS Terraform module](https://registry.terraform.io/modules/terraform-aws-modules/eks/aws/latest) +in the Terraform Registry for more information. + +## About Using the Operator with Amazon EKS + +To use the NVIDIA GPU Operator with Amazon Elastic Kubernetes Service (EKS) +without any limitations, you perform the following high-level actions: + +* Create a self-managed or managed node group with instance types that have NVIDIA GPUs. + + Refer to the following resources in the Amazon EC2 documentation to help you choose + the instance type to meet your needs: + + * Table of accelerated computing + [instance types](https://aws.amazon.com/ec2/instance-types/accelerated-computing/) + for information about GPU model and count, RAM, and storage. + + * [Maximum IP addresses per network interface](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AvailableIpPerENI.html) + for accelerated computing instance types. + Make sure the instance type supports enough IP addresses for your workload. + For example, the `g4dn.xlarge` instance type supports `29` IP addresses for pods on the node. + +* Use an Amazon EKS optimized Amazon Machine Image (AMI) with a supported operating system (use the `gpu-operator-references` skill) on the nodes in the node group. + + AMIs support are specific to an AWS region and Kubernetes version. + See https://cloud-images.ubuntu.com/aws-eks/ for the AMI values such as `ami-00687acd80b7a620a`. + +* Use your preferred client application to create the node group. diff --git a/gpu-operator/.agents/skills/gpu-operator-nvidia-amazon/references/eksctl-example.md b/gpu-operator/.agents/skills/gpu-operator-nvidia-amazon/references/eksctl-example.md new file mode 100644 index 000000000..bd3c5354b --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-nvidia-amazon/references/eksctl-example.md @@ -0,0 +1,119 @@ + + + +# Example: Create a Self-Managed Node Group with eksctl + +## Prerequisites + +* You have access to the Amazon Management Console or you installed and configured the AWS CLI. + Refer to + [Installing or updating to the latest version of the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) + and [Configuring the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) + in the AWS CLI documentation. +* You installed the `eksctl` CLI if you prefer it as your client application. + The CLI is available from https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html#eksctl-install-update. +* You have the AMI value from https://cloud-images.ubuntu.com/aws-eks/. +* You have the EC2 instance type to use for your nodes. + +## Procedure + +The following steps show how to create an Amazon EKS cluster with the `eksctl` CLI. +The steps create a self-managed node group that uses an Amazon EKS optimized AMI. + +1. Create a file, such as `cluster-config.yaml`, with contents like the following example: + + ```yaml + apiVersion: eksctl.io/v1alpha5 + kind: ClusterConfig + metadata: + name: demo-cluster + region: us-west-2 + version: "1.25" + nodeGroups: + - name: demo-gpu-workers + instanceType: g4dn.xlarge + ami: ami-0770ab88ec35aa875 + amiFamily: Ubuntu2004 + minSize: 1 + desiredCapacity: 3 + maxSize: 3 + volumeSize: 100 + overrideBootstrapCommand: | + #!/bin/bash + source /var/lib/cloud/scripts/eksctl/bootstrap.helper.sh + /etc/eks/bootstrap.sh ${CLUSTER_NAME} --container-runtime containerd --kubelet-extra-args "--node-labels=${NODE_LABELS}" + ssh: + allow: true + publicKeyPath: ~/.ssh/id_rsa.pub + ``` + + Replace the values for the cluster name, Kubernetes version, and so on. + To resolve the environment variables in the override bootstrap command, you must source the bootstrap helper script. + + > [!TIP] + > The default volume size for each node is 20 GB. + > In many cases, containers with frameworks for AI/ML workloads are often very large. + > The sample YAML file specifies a 100 GB volume to ensure enough local disk space for containers. + + 1. Create the Amazon EKS cluster with the node group: + + ```console + $ eksctl create cluster -f cluster-config.yaml + ``` + + Creating the cluster requires several minutes. + + *Example Output* + + ```output + 2022-08-19 17:51:04 [i] eksctl version 0.105.0 + 2022-08-19 17:51:04 [i] using region us-west-2 + 2022-08-19 17:51:04 [i] setting availability zones to [us-west-2d us-west-2c us-west-2a] + 2022-08-19 17:51:04 [i] subnets for us-west-2d - public:192.168.0.0/19 private:192.168.96.0/19 + ... + [✓] EKS cluster "demo-cluster" in "us-west-2" region is ready + ``` + +1. Optional: View the cluster name: + + ```console + $ eksctl get cluster + ``` + + *Example Output* + + ```output + NAME REGION EKSCTL CREATED + demo-cluster us-west-2 True + ``` + +## Verification + +After the node group is created and the GPU Operator is installed on the cluster (use the `gpu-operator-install` skill), confirm that the GPU nodes are managed: + +1. Confirm the GPU nodes advertise GPU capacity: + + ```console + $ kubectl get nodes -o json | jq '.items[].status.capacity."nvidia.com/gpu"' + ``` + + Each GPU node should report a non-null GPU count. + +1. Confirm the GPU Operator pods are running: + + ```console + $ kubectl get pods -n gpu-operator + ``` + + The `nvidia-operator-validator` pod should report `Completed`. + +## Related Information + +* The preceding procedure is derived from + [Getting started with Amazon EKS - eksctl](https://docs.aws.amazon.com/eks/latest/userguide/getting-started-eksctl.html) + in the Amazon EKS documentation. +* If you have an existing Amazon EKS cluster, you can refer to + [Launching self-managed Amazon Linux nodes](https://docs.aws.amazon.com/eks/latest/userguide/launch-workers.html) + in the Amazon EKS documentation to add a self-managed node group to your cluster. + However, all nodes in the cluster must run Ubuntu 20.04 or 22.04. + This documentation includes steps for using the AWS Management Console. diff --git a/gpu-operator/.agents/skills/gpu-operator-nvidia-google/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-nvidia-google/SKILL.md index c081483ec..800a3c690 100644 --- a/gpu-operator/.agents/skills/gpu-operator-nvidia-google/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-nvidia-google/SKILL.md @@ -20,6 +20,12 @@ tags: # NVIDIA GPU Operator with Google GKE +Deploy the NVIDIA GPU Operator on Google GKE Standard node pools. Two driver +strategies are supported: the **Google driver installer** (the Google installer +manages the driver; the Operator manages everything else) and the **NVIDIA +Driver Manager** (the Operator manages the driver and the full software +lifecycle). The choice depends on node OS. GKE Autopilot is not supported. + ## Prerequisites - You installed and initialized the Google Cloud CLI. Refer to [gcloud CLI overview](https://cloud.google.com/sdk/gcloud) in the Google Cloud documentation. @@ -27,254 +33,31 @@ tags: - You have the project ID for your Google Cloud project. Refer to [Identifying projects](https://cloud.google.com/resource-manager/docs/creating-managing-projects#identifying_projects) in the Google Cloud documentation. - You know the machine type for the node pool and that the machine type is supported in your region and zone. Refer to [GPU platforms](https://cloud.google.com/compute/docs/gpus) in the Google Cloud documentation. -## About Using the Operator with Google GKE - -There are two ways to use NVIDIA GPU Operator with Google Kubernetes Engine (GKE). -You can use Google driver installer to install and manage NVIDIA GPU Driver on the nodes -or you can use the Operator and driver manager to manage the driver and other NVIDIA software components. - -The choice depends on the operating system and whether you prefer to have the Operator manage all the software components. - -| Approach | Supported OS | Summary | -| --- | --- | --- | -| Google Driver Installer | Container-Optimized OS, Ubuntu with containerd | The Google driver installer manages the NVIDIA GPU Driver. NVIDIA GPU Operator manages other software components. | -| NVIDIA Driver Manager | Ubuntu with containerd | NVIDIA GPU Operator manages the lifecycle and upgrades of the driver and other NVIDIA software. | - -The preceding information relates to using GKE Standard node pools. -For Autopilot Pods, using the GPU Operator is not supported, and you can refer to -[Deploy GPU workloads in Autopilot](https://cloud.google.com/kubernetes-engine/docs/how-to/autopilot-gpus). - -## Using the Google Driver Installer - -Perform the following steps to create a GKE cluster with the `gcloud` CLI and use Google driver installer to manage the GPU driver. -You can create a node pool that uses a Container-Optimized OS node image or a Ubuntu node image. - -1. Create the node pool. - Refer to [Running GPUs in GKE Standard clusters](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#create) - in the GKE documentation. - - When you create the node pool, specify the following additional `gcloud` command-line options to disable GKE features that are not supported with the Operator: - - - `--node-labels="gke-no-default-nvidia-gpu-device-plugin=true"` - - The node label disables the GKE GPU device plugin daemon set on GPU nodes. - - - `--accelerator type=...,gpu-driver-version=disabled` - - This argument disables automatically installing the GPU driver on GPU nodes. - -1. Get the authentication credentials for the cluster: - - ```console - $ gcloud container clusters get-credentials demo-cluster --location us-west1 - ``` - -1. Optional: Verify that you can connect to the cluster: - - ```console - $ kubectl get nodes -o wide - ``` - -1. Create the namespace for the NVIDIA GPU Operator: - - ```console - $ kubectl create ns gpu-operator - ``` - -1. Create a file, such as `gpu-operator-quota.yaml`, with contents like the following example: - - ```yaml - apiVersion: v1 - kind: ResourceQuota - metadata: - name: gpu-operator-quota - spec: - hard: - pods: 100 - scopeSelector: - matchExpressions: - - operator: In - scopeName: PriorityClass - values: - - system-node-critical - - system-cluster-critical - ``` - -1. Apply the resource quota: - - ```console - $ kubectl apply -n gpu-operator -f gpu-operator-quota.yaml - ``` - -1. Optional: View the resource quota: - - ```console - $ kubectl get -n gpu-operator resourcequota - ``` - - *Example Output* - - ```output - NAME AGE REQUEST - gpu-operator-quota 38s pods: 0/100 - ``` - -1. Install the Google driver installer daemon set. - - For Container-Optimized OS: - - ```console - $ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml - ``` +## Activation - For Ubuntu, the manifest to apply depends on GPU model and node version. - Refer to the **Ubuntu** tab at - [Manually install NVIDIA GPU drivers](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers) - in the GKE documentation. +Do this first: pick the driver-management approach matching your node OS from +the Phases table below, then **read the corresponding `references/.md` +file before acting**. All `gcloud`/`kubectl`/`helm` command sequences, the +`ResourceQuota` YAML, Helm `--set` values, and verification output live only in +those reference files — do not improvise commands from this dispatch layer. -> [!NOTE] -> Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). +## Phases -1. Install the Operator using Helm: +| Phase | Summary | Reference | +|-------|---------|-----------| +| Approaches | The two GKE strategies (Google Driver Installer vs NVIDIA Driver Manager), their supported OSes, the Autopilot non-support note, and the standard node-pool disabling flags. | [references/approaches.md](references/approaches.md) | +| Google Driver Installer | Create the node pool with GKE-device-plugin/driver disabled, set credentials, create the `gpu-operator` namespace + a `ResourceQuota`, apply the Google driver-installer daemonset (COS/Ubuntu), and Helm-install the Operator with `driver.enabled=false` and the GKE install paths. | [references/google-driver-installer.md](references/google-driver-installer.md) | +| NVIDIA Driver Manager | Create an `UBUNTU_CONTAINERD` cluster with the GKE device plugin disabled, set credentials, create the namespace + `ResourceQuota`, then install the Operator (via the `gpu-operator-install` skill) so it manages the driver. | [references/nvidia-driver-manager.md](references/nvidia-driver-manager.md) | - ```console - $ helm install --wait --generate-name \ - -n gpu-operator \ - nvidia/gpu-operator \ - --version= \ - --set hostPaths.driverInstallDir=/home/kubernetes/bin/nvidia \ - --set toolkit.installDir=/home/kubernetes/bin/nvidia \ - --set cdi.enabled=true \ - --set cdi.default=true \ - --set driver.enabled=false - ``` +## Hard rules (apply across all phases) - Set the NVIDIA Container Toolkit and driver installation path to `/home/kubernetes/bin/nvidia`. - On GKE node images, this directory is writable and is a stateful location for storing the NVIDIA runtime binaries. - - To configure MIG with NVIDIA MIG Manager, specify the following additional Helm command arguments: - - ```console - --set migManager.env[0].name=WITH_REBOOT \ - --set-string migManager.env[0].value=true - ``` - -## Using NVIDIA Driver Manager - -Perform the following steps to create a GKE cluster with the `gcloud` CLI and use the Operator and NVIDIA Driver Manager to manage the GPU driver. -The steps create the cluster with a node pool that uses a Ubuntu and containerd node image. - -1. Create the cluster by running a command that is similar to the following example: - - ```console - $ gcloud beta container clusters create demo-cluster \ - --project \ - --location us-west1 \ - --release-channel "regular" \ - --machine-type "n1-standard-4" \ - --accelerator "type=nvidia-tesla-t4,count=1" \ - --image-type "UBUNTU_CONTAINERD" \ - --node-labels="gke-no-default-nvidia-gpu-device-plugin=true" \ - --disk-type "pd-standard" \ - --disk-size "1000" \ - --no-enable-intra-node-visibility \ - --metadata disable-legacy-endpoints=true \ - --max-pods-per-node "110" \ - --num-nodes "1" \ - --logging=SYSTEM,WORKLOAD \ - --monitoring=SYSTEM \ - --enable-ip-alias \ - --default-max-pods-per-node "110" \ - --no-enable-master-authorized-networks \ - --tags=nvidia-ingress-all - ``` - - Creating the cluster requires several minutes. - -1. Get the authentication credentials for the cluster: - - ```console - $ USE_GKE_GCLOUD_AUTH_PLUGIN=True \ - gcloud container clusters get-credentials demo-cluster --zone us-west1 - ``` - -1. Optional: Verify that you can connect to the cluster: - - ```console - $ kubectl get nodes -o wide - ``` - -1. Create the namespace for the NVIDIA GPU Operator: - - ```console - $ kubectl create ns gpu-operator - ``` - -1. Create a file, such as `gpu-operator-quota.yaml`, with contents like the following example: - - ```yaml - apiVersion: v1 - kind: ResourceQuota - metadata: - name: gpu-operator-quota - spec: - hard: - pods: 100 - scopeSelector: - matchExpressions: - - operator: In - scopeName: PriorityClass - values: - - system-node-critical - - system-cluster-critical - ``` - -1. Apply the resource quota: - - ```console - $ kubectl apply -n gpu-operator -f gpu-operator-quota.yaml - ``` - -1. Optional: View the resource quota: - - ```console - $ kubectl get -n gpu-operator resourcequota - ``` - - *Example Output* - - ```output - NAME AGE REQUEST - gke-resource-quotas 6m56s count/ingresses.extensions: 0/100, count/ingresses.networking.k8s.io: 0/100, count/jobs.batch: 0/5k, pods: 2/1500, services: 1/500 - gpu-operator-quota 38s pods: 0/100 - ``` - -1. Install the Operator (use the `gpu-operator-install` skill). +- GKE Autopilot does not support the GPU Operator; use Standard node pools. +- On all GPU node pools, disable the GKE GPU device plugin (`--node-labels="gke-no-default-nvidia-gpu-device-plugin=true"`) and, for the Google-installer/NVIDIA-manager paths, the automatic driver install (`--accelerator ...,gpu-driver-version=disabled`). +- For the Google Driver Installer path, set the toolkit/driver install dir to the writable `/home/kubernetes/bin/nvidia` and install with `driver.enabled=false`. +- Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). Never hardcode a version. ## Verification -After installing the Operator, confirm that the GPU nodes are managed and operands are healthy: - -1. Confirm the GPU nodes advertise GPU capacity: - - ```console - $ kubectl get nodes -o json | jq '.items[].status.capacity."nvidia.com/gpu"' - ``` - -1. Confirm the GPU Operator pods are running: - - ```console - $ kubectl get pods -n gpu-operator - ``` - - The `nvidia-operator-validator` pod should report `Completed`. - -## Related Information - -* If you have an existing GKE cluster, refer to - [Add and manage node pools](https://cloud.google.com/kubernetes-engine/docs/how-to/node-pools) - in the GKE documentation. -* When you create new node pools, specify the - `--node-labels="gke-no-default-nvidia-gpu-device-plugin=true"` and - `--accelerator type=...,gpu-driver-version=disabled` CLI arguments - to disable the GKE GPU device plugin daemon set and automatic driver installation on GPU nodes. +Confirm GPU nodes advertise `nvidia.com/gpu` capacity and the +`nvidia-operator-validator` pod is `Completed`. Exact commands are in +[references/nvidia-driver-manager.md](references/nvidia-driver-manager.md). diff --git a/gpu-operator/.agents/skills/gpu-operator-nvidia-google/references/approaches.md b/gpu-operator/.agents/skills/gpu-operator-nvidia-google/references/approaches.md new file mode 100644 index 000000000..7a1db57e5 --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-nvidia-google/references/approaches.md @@ -0,0 +1,29 @@ + + + +# About Using the Operator with Google GKE + +There are two ways to use NVIDIA GPU Operator with Google Kubernetes Engine (GKE). +You can use Google driver installer to install and manage NVIDIA GPU Driver on the nodes +or you can use the Operator and driver manager to manage the driver and other NVIDIA software components. + +The choice depends on the operating system and whether you prefer to have the Operator manage all the software components. + +| Approach | Supported OS | Summary | +| --- | --- | --- | +| Google Driver Installer | Container-Optimized OS, Ubuntu with containerd | The Google driver installer manages the NVIDIA GPU Driver. NVIDIA GPU Operator manages other software components. | +| NVIDIA Driver Manager | Ubuntu with containerd | NVIDIA GPU Operator manages the lifecycle and upgrades of the driver and other NVIDIA software. | + +The preceding information relates to using GKE Standard node pools. +For Autopilot Pods, using the GPU Operator is not supported, and you can refer to +[Deploy GPU workloads in Autopilot](https://cloud.google.com/kubernetes-engine/docs/how-to/autopilot-gpus). + +## Related Information + +* If you have an existing GKE cluster, refer to + [Add and manage node pools](https://cloud.google.com/kubernetes-engine/docs/how-to/node-pools) + in the GKE documentation. +* When you create new node pools, specify the + `--node-labels="gke-no-default-nvidia-gpu-device-plugin=true"` and + `--accelerator type=...,gpu-driver-version=disabled` CLI arguments + to disable the GKE GPU device plugin daemon set and automatic driver installation on GPU nodes. diff --git a/gpu-operator/.agents/skills/gpu-operator-nvidia-google/references/google-driver-installer.md b/gpu-operator/.agents/skills/gpu-operator-nvidia-google/references/google-driver-installer.md new file mode 100644 index 000000000..8f286e8d3 --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-nvidia-google/references/google-driver-installer.md @@ -0,0 +1,117 @@ + + + +# Using the Google Driver Installer + +Perform the following steps to create a GKE cluster with the `gcloud` CLI and use Google driver installer to manage the GPU driver. +You can create a node pool that uses a Container-Optimized OS node image or a Ubuntu node image. + +1. Create the node pool. + Refer to [Running GPUs in GKE Standard clusters](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#create) + in the GKE documentation. + + When you create the node pool, specify the following additional `gcloud` command-line options to disable GKE features that are not supported with the Operator: + + - `--node-labels="gke-no-default-nvidia-gpu-device-plugin=true"` + + The node label disables the GKE GPU device plugin daemon set on GPU nodes. + + - `--accelerator type=...,gpu-driver-version=disabled` + + This argument disables automatically installing the GPU driver on GPU nodes. + +1. Get the authentication credentials for the cluster: + + ```console + $ gcloud container clusters get-credentials demo-cluster --location us-west1 + ``` + +1. Optional: Verify that you can connect to the cluster: + + ```console + $ kubectl get nodes -o wide + ``` + +1. Create the namespace for the NVIDIA GPU Operator: + + ```console + $ kubectl create ns gpu-operator + ``` + +1. Create a file, such as `gpu-operator-quota.yaml`, with contents like the following example: + + ```yaml + apiVersion: v1 + kind: ResourceQuota + metadata: + name: gpu-operator-quota + spec: + hard: + pods: 100 + scopeSelector: + matchExpressions: + - operator: In + scopeName: PriorityClass + values: + - system-node-critical + - system-cluster-critical + ``` + +1. Apply the resource quota: + + ```console + $ kubectl apply -n gpu-operator -f gpu-operator-quota.yaml + ``` + +1. Optional: View the resource quota: + + ```console + $ kubectl get -n gpu-operator resourcequota + ``` + + *Example Output* + + ```output + NAME AGE REQUEST + gpu-operator-quota 38s pods: 0/100 + ``` + +1. Install the Google driver installer daemon set. + + For Container-Optimized OS: + + ```console + $ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml + ``` + + For Ubuntu, the manifest to apply depends on GPU model and node version. + Refer to the **Ubuntu** tab at + [Manually install NVIDIA GPU drivers](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers) + in the GKE documentation. + + > [!NOTE] + > Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). + +1. Install the Operator using Helm: + + ```console + $ helm install --wait --generate-name \ + -n gpu-operator \ + nvidia/gpu-operator \ + --version= \ + --set hostPaths.driverInstallDir=/home/kubernetes/bin/nvidia \ + --set toolkit.installDir=/home/kubernetes/bin/nvidia \ + --set cdi.enabled=true \ + --set cdi.default=true \ + --set driver.enabled=false + ``` + + Set the NVIDIA Container Toolkit and driver installation path to `/home/kubernetes/bin/nvidia`. + On GKE node images, this directory is writable and is a stateful location for storing the NVIDIA runtime binaries. + + To configure MIG with NVIDIA MIG Manager, specify the following additional Helm command arguments: + + ```console + --set migManager.env[0].name=WITH_REBOOT \ + --set-string migManager.env[0].value=true + ``` diff --git a/gpu-operator/.agents/skills/gpu-operator-nvidia-google/references/nvidia-driver-manager.md b/gpu-operator/.agents/skills/gpu-operator-nvidia-google/references/nvidia-driver-manager.md new file mode 100644 index 000000000..0c60c9c93 --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-nvidia-google/references/nvidia-driver-manager.md @@ -0,0 +1,112 @@ + + + +# Using NVIDIA Driver Manager + +Perform the following steps to create a GKE cluster with the `gcloud` CLI and use the Operator and NVIDIA Driver Manager to manage the GPU driver. +The steps create the cluster with a node pool that uses a Ubuntu and containerd node image. + +1. Create the cluster by running a command that is similar to the following example: + + ```console + $ gcloud beta container clusters create demo-cluster \ + --project \ + --location us-west1 \ + --release-channel "regular" \ + --machine-type "n1-standard-4" \ + --accelerator "type=nvidia-tesla-t4,count=1" \ + --image-type "UBUNTU_CONTAINERD" \ + --node-labels="gke-no-default-nvidia-gpu-device-plugin=true" \ + --disk-type "pd-standard" \ + --disk-size "1000" \ + --no-enable-intra-node-visibility \ + --metadata disable-legacy-endpoints=true \ + --max-pods-per-node "110" \ + --num-nodes "1" \ + --logging=SYSTEM,WORKLOAD \ + --monitoring=SYSTEM \ + --enable-ip-alias \ + --default-max-pods-per-node "110" \ + --no-enable-master-authorized-networks \ + --tags=nvidia-ingress-all + ``` + + Creating the cluster requires several minutes. + +1. Get the authentication credentials for the cluster: + + ```console + $ USE_GKE_GCLOUD_AUTH_PLUGIN=True \ + gcloud container clusters get-credentials demo-cluster --zone us-west1 + ``` + +1. Optional: Verify that you can connect to the cluster: + + ```console + $ kubectl get nodes -o wide + ``` + +1. Create the namespace for the NVIDIA GPU Operator: + + ```console + $ kubectl create ns gpu-operator + ``` + +1. Create a file, such as `gpu-operator-quota.yaml`, with contents like the following example: + + ```yaml + apiVersion: v1 + kind: ResourceQuota + metadata: + name: gpu-operator-quota + spec: + hard: + pods: 100 + scopeSelector: + matchExpressions: + - operator: In + scopeName: PriorityClass + values: + - system-node-critical + - system-cluster-critical + ``` + +1. Apply the resource quota: + + ```console + $ kubectl apply -n gpu-operator -f gpu-operator-quota.yaml + ``` + +1. Optional: View the resource quota: + + ```console + $ kubectl get -n gpu-operator resourcequota + ``` + + *Example Output* + + ```output + NAME AGE REQUEST + gke-resource-quotas 6m56s count/ingresses.extensions: 0/100, count/ingresses.networking.k8s.io: 0/100, count/jobs.batch: 0/5k, pods: 2/1500, services: 1/500 + gpu-operator-quota 38s pods: 0/100 + ``` + +1. Install the Operator (use the `gpu-operator-install` skill). + +## Verification + +After installing the Operator, confirm that the GPU nodes are managed and operands are healthy: + +1. Confirm the GPU nodes advertise GPU capacity: + + ```console + $ kubectl get nodes -o json | jq '.items[].status.capacity."nvidia.com/gpu"' + ``` + +1. Confirm the GPU Operator pods are running: + + ```console + $ kubectl get pods -n gpu-operator + ``` + + The `nvidia-operator-validator` pod should report `Completed`. From f337e7a157203b4918716e9ababe51167aba318e Mon Sep 17 00:00:00 2001 From: Andrew Chen Date: Mon, 1 Jun 2026 21:57:47 -0700 Subject: [PATCH 12/13] refactor(skills): info-hiding restructure batch 4 (driver-upgrades, upgrading-nvidia) Restructure 2 procedural GPU Operator skills to the dispatch-layer information-hiding pattern: thin SKILL.md (<200 lines) with all step-by-step detail moved into phase-specific references/*.md. Skills: driver-upgrades (concepts, upgrade-controller, without-controller), upgrading-nvidia (helm-upgrade, other-updates). While moving content, repaired pre-existing un-fenced bash blocks in the driver-upgrades troubleshooting steps (wrapped bare '$ kubectl ...' lines in console code fences). All other content preserved verbatim, including the upgrade-controller state-machine graphics reference and placeholders. Both pass the close-your-eyes test. Co-Authored-By: Claude Opus 4.8 (1M context) Signed-off-by: Andrew Chen --- .../gpu-operator-driver-upgrades/SKILL.md | 324 ++---------------- .../references/concepts.md | 19 + .../references/upgrade-controller.md | 238 +++++++++++++ .../references/without-controller.md | 61 ++++ .../gpu-operator-upgrading-nvidia/SKILL.md | 197 ++--------- .../references/helm-upgrade.md | 151 ++++++++ .../references/other-updates.md | 35 ++ 7 files changed, 561 insertions(+), 464 deletions(-) create mode 100644 gpu-operator/.agents/skills/gpu-operator-driver-upgrades/references/concepts.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-driver-upgrades/references/upgrade-controller.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-driver-upgrades/references/without-controller.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-upgrading-nvidia/references/helm-upgrade.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-upgrading-nvidia/references/other-updates.md diff --git a/gpu-operator/.agents/skills/gpu-operator-driver-upgrades/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-driver-upgrades/SKILL.md index 3b5115e6b..1fd9555b3 100644 --- a/gpu-operator/.agents/skills/gpu-operator-driver-upgrades/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-driver-upgrades/SKILL.md @@ -20,309 +20,47 @@ tags: # GPU Driver Upgrades +Manage upgrades of the containerized NVIDIA driver. Because the driver kernel +modules must be unloaded and reloaded on each restart, the Operator automates +the disable-clients / unload / restart-pod / install / re-enable sequence. Two +mechanisms are available: the recommended **upgrade controller** (default-on, +observable, with pause/skip and per-node state) and the legacy +**`k8s-driver-manager`** init container. + ## Prerequisites - A running Kubernetes cluster with NVIDIA GPU worker nodes. - The NVIDIA GPU Operator installed (use the `gpu-operator-install` skill). - The driver deployed as a container by the Operator (`driver.enabled=true`, the default). The GPU Operator only manages the lifecycle of containerized drivers; drivers pre-installed on the host are not managed by the Operator. -## About Upgrading the GPU Driver - -The NVIDIA driver daemon set requires special consideration for upgrades because the driver kernel modules must be unloaded and loaded again on each driver container restart. -Consequently, the following steps must occur across a driver upgrade: - -1. Disable all clients to the GPU driver. -1. Unload the current GPU driver kernel modules. -1. Start the updated GPU driver pod. -1. Install the updated GPU driver and load the updated kernel modules. -1. Enable the clients of the GPU driver. - -The GPU Operator supports several methods for managing and automating this driver upgrade process. - -> [!NOTE] -> The GPU Operator only manages the lifecycle of containerized drivers. -> Drivers which are pre-installed on the host are not managed by the GPU Operator. - -## Upgrades with the Upgrade Controller - -NVIDIA recommends upgrading by using the upgrade controller and the controller is enabled by default in the GPU Operator. -The controller automates the upgrade process and generates metrics and events so that you can monitor the upgrade process. - -### Procedure -1. Upgrade the driver by changing the `driver.version` value in the cluster policy: - - ```console - $ kubectl patch clusterpolicies.nvidia.com/cluster-policy \ - --type='json' \ - -p='[{"op": "replace", "path": "/spec/driver/version", "value":"580.95.05"}]' - ``` - - If you are using Openshift, you must update the `driver.version`, `driver.repository` and `driver.image` values in the cluster policy. - - ```console - $ kubectl patch clusterpolicies.nvidia.com/cluster-policy \ - --type='json' \ - -p='[{"op": "replace", "path": "/spec/driver/version", "value":"580.95.05"},{"op": "replace", "path": "/spec/driver/repository", "value":"nvcr.io/nvidia"},{"op": "replace", "path": "/spec/driver/image", "value":"driver"}]' - ``` - -2. (Optional) For each node, monitor the upgrade status: - - ```console - $ kubectl get node -l nvidia.com/gpu.present \ - -ojsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.labels.nvidia\.com/gpu-driver-upgrade-state}{"\n"}{end}' - ``` - - *Example Output* - - ```output - k8s-node-1 upgrade-required - k8s-node-2 upgrade-required - k8s-node-3 upgrade-required - ``` - - You can periodically poll the upgrade status by running the preceding command. - The GPU driver upgrade is complete when the output shows `upgrade-done`: - - ```output - k8s-node-1 upgrade-done - k8s-node-2 upgrade-done - k8s-node-3 upgrade-done - ``` - -### Configuration Options - -You can set the following fields in the cluster policy to configure the upgrade controller: - -```yaml -driver: - - upgradePolicy: - # autoUpgrade (default=true): Switch which enables / disables the driver upgrade controller. - # If set to false all other options are ignored. - autoUpgrade: true - # maxParallelUpgrades (default=1): Number of nodes that can be upgraded in parallel. 0 means infinite. - maxParallelUpgrades: 1 - # maximum number of nodes with the driver installed, that can be unavailable during - # the upgrade. Value can be an absolute number (ex: 5) or - # a percentage of total nodes at the start of upgrade (ex: - # 10%). Absolute number is calculated from percentage by rounding - # up. By default, a fixed value of 25% is used.' - maxUnavailable: 25% - # waitForCompletion: Options for the 'wait-for-completion' state, which will wait for a user-defined group of pods - # to complete before upgrading the driver on a node. - waitForCompletion: - # timeoutSeconds (default=0): The length of time to wait before giving up. 0 means infinite. - timeoutSeconds: 0 - # podSelector (default=""): The label selector defining the group of pods to wait for completion of. "" means to wait on none. - podSelector: "" - - # gpuPodDeletion: Options for the 'pod-deletion' state, which will evict all pods on the node allocated a GPU. - gpuPodDeletion: - # force (default=false): Delete pods even if they are not managed by a controller (for example ReplicationController, ReplicaSet, - # Job, DaemonSet or StatefulSet). - force: false - # timeoutSeconds (default=300): The length of time to wait before giving up. 0 means infinite. When the timeout is met, - # the GPU pod(s) will be forcefully deleted. - timeoutSeconds: 300 - # deleteEmptyDir (default=false): Delete pods even if they are using emptyDir volumes (local data will be deleted). - deleteEmptyDir: false - - # drain: Options for the 'drain' state, which invokes 'kubectl drain' on the node. - # Unlike 'gpuPodDeletion', which targets only GPU-allocated pods, drain evicts all pods on the node. - # This should only be enabled as a fallback when 'gpuPodDeletion' cannot remove all GPU-using pods on its own. - drain: - # enable (default=false): Set to true to allow node drain as a fallback when - # 'gpuPodDeletion' cannot evict all GPU pods. By default, drain evicts all pods - # on the node. Use podSelector to limit which pods are evicted. - enable: false - # force (default=false): Delete pods even if they are not managed by a controller - # (for example, ReplicationController, ReplicaSet, Job, DaemonSet, or StatefulSet). - # Applies to all pods on the node, not just GPU pods. - force: false - # podSelector (default=""): Label selector to restrict which pods are evicted - # during drain. An empty string means all pods on the node are evicted. - podSelector: "" - # timeoutSeconds (default=300): The length of time to wait before giving up. - # 0 means infinite. When the timeout is reached, the drain attempt is abandoned. - timeoutSeconds: 300 - # deleteEmptyDir (default=false): Allow eviction of pods that use emptyDir volumes. - # Enabling this results in permanent loss of any data stored in those volumes. - deleteEmptyDir: false -``` - -> [!WARNING] -> `driver.upgradePolicy.drain.enable` is a cluster-wide policy setting. -> When set to `true`, the upgrade controller drains each node before upgrading the driver on that node. -> Draining a node evicts all pods from that node, including workloads unrelated to the GPU driver. -> This is a disruptive operation that interrupts running GPU and non-GPU workloads on every node the upgrade controller processes. - -Enable `drain` only when `gpuPodDeletion` is insufficient to remove all GPU-using pods on its own. -Adjust the `gpuPodDeletion` settings first and use `drain` only if those settings do not work. -If you must enable `drain`, use `podSelector` to limit which pods are evicted. -If you specify a value for `maxUnavailable` and also specify `maxParallelUpgrades`, -the `maxUnavailable` value applies an additional constraint on the value of -`maxParallelUpgrades` to ensure that the number of parallel upgrades does not -cause more than the intended number of nodes to become unavailable during the upgrade. -For example, if you specify `maxUnavailable=100%` and `maxParallelUpgrades=1`, -one node is upgraded at a time . - -The `maxUnavailable` value also applies to the currently unavailable nodes in the cluster. -If you cordoned nodes in the cluster and the `maxUnavailable` value is already met by the number of cordoned nodes, -then the upgrade does not progress. - -### Upgrade State Machine - -The upgrade controller manages driver upgrades through a well-defined state machine. -The node label, `nvidia.com/gpu-driver-upgrade-state`, indicates the state a node is currently in. -The set of possible states are: - -* Unknown (empty): The upgrade controller is disabled or the node has not been processed yet. -* `upgrade-required`: NVIDIA driver pod is not up-to-date and requires an upgrade. No actions are performed at this stage. -* `cordon-required`: Node will be marked Unschedulable in preparation for the driver upgrade. -* `wait-for-jobs-required`: Node will wait on the completion of a group of pods/jobs before proceeding. -* `pod-deletion-required`: Pods allocated with GPUs are deleted from the node. If pod deletion fails, the node state is set to `drain-required` - if drain is enabled in ClusterPolicy. -* `drain-required`: Node is drained using `kubectl drain`, which evicts all pods on the - node. - This state is only reached if `gpuPodDeletion` fails to remove all - GPU-using pods and `drain.enable` is set to `true` in the cluster policy. - This state is skipped if all GPU pods are successfully deleted from the node. -* `pod-restart-required`: The NVIDIA driver pod running on the node will be restarted and upgraded to the new version. -* `validation-required`: Validation of the new driver deployed on the node is required before proceeding. The GPU Operator - performs validations in the pod named `operator-validator`. -* `uncordon-required`: Node will be marked Schedulable to complete the upgrade process. -* `upgrade-done`: NVIDIA driver pod is up-to-date and running on the node. -* `upgrade-failed`: A failure occurred during the driver upgrade. - -The complete state machine is depicted in the diagram below. - -![](graphics/upgrade-controller-state-machine.png) -### Pausing Driver Upgrades - -To pause the automatic driver upgrade process in the cluster, toggle `driver.upgradePolicy.autoUpgrade` flag -in the cluster policy. -The entire state machine pauses and effectively disables any pending nodes from being upgraded. -You can toggle the flag to `true` again to re-enable the upgrade controller and resume any pending upgrades. - -### Skipping Driver Upgrades - -To skip driver upgrades on a certain node, label the node with `nvidia.com/gpu-driver-upgrade.skip=true`. - -### Metrics and Events - -The GPU Operator generates the following metrics during the upgrade process which can be scraped by Prometheus. - -* `gpu_operator_auto_upgrade_enabled`: 1 if driver auto upgrade is enabled; 0 if not. -* `gpu_operator_nodes_upgrades_in_progress`: Total number of nodes in which a driver pod is being upgraded on. -* `gpu_operator_nodes_upgrades_done`: Total number of nodes in which a driver pod has been successfully upgraded. -* `gpu_operator_nodes_upgrades_failed`: Total number of nodes in which a driver pod upgrade has failed. -* `gpu_operator_nodes_upgrades_available`: Total number of nodes in which a driver pod upgrade can start on. -* `gpu_operator_nodes_upgrades_pending`: Total number of nodes in which driver pod upgrades are pending. - -The GPU Operator generates events during the upgrade process. -The most common events are for state transitions or failures at a particular state. -Below are an example set of events generated for the upgrade of one node. - -```console -$ kubectl get events -n default --sort-by='.lastTimestamp' | grep GPUDriverUpgrade -``` - -*Example Output* - -```output -10m Normal GPUDriverUpgrade node/localhost.localdomain Successfully updated node state label to [upgrade-required] -10m Normal GPUDriverUpgrade node/localhost.localdomain Successfully updated node state label to [cordon-required] -10m Normal GPUDriverUpgrade node/localhost.localdomain Successfully updated node state label to [wait-for-jobs-required] -10m Normal GPUDriverUpgrade node/localhost.localdomain Successfully updated node state label to [pod-deletion-required] -10m Normal GPUDriverUpgrade node/localhost.localdomain Successfully updated node state label to [pod-restart-required] -7m Normal GPUDriverUpgrade node/localhost.localdomain Successfully updated node state label to [validation-required] -6m Normal GPUDriverUpgrade node/localhost.localdomain Successfully updated node state label to [uncordon-required] -6m Normal GPUDriverUpgrade node/localhost.localdomain Successfully updated node state label to [upgrade-done] -``` - -### Troubleshooting - -If the upgrade fails for a particular node, the node is labelled with the `upgrade-failed` state. - -1. View the upgrade state labels: - - ```console - $ kubectl get node -l nvidia.com/gpu.present \ - -ojsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.labels.nvidia\.com/gpu-driver-upgrade-state}{"\n"}{end}' - ``` - - *Example Output* - - ```output - k8s-node-1 upgrade-done - k8s-node-2 upgrade-done - k8s-node-3 upgrade-failed - ``` - -1. Check the events to determine the stage that the upgrade failed: - - $ kubectl get events -n default --sort-by='.lastTimestamp' | grep GPUDriverUpgrade -1. (Optional) Check the logs from the upgrade controller in the gpu-operator container: - - $ kubectl logs -n gpu-operator gpu-operator-xxxxx | grep controllers.Upgrade -1. After resolving the upgrade failures for a particular node, you can restart the upgrade process on the node by placing it in the `upgrade-required` state: - - $ kubectl label node nvidia.com/gpu-driver-upgrade-state=upgrade-required --overwrite - -## Upgrades without the Upgrade Controller - -If the upgrade controller is disabled or not supported for your GPU Operator version, a component called `k8s-driver-manager` is responsible -for executing the driver upgrade process. -The `k8s-driver-manager` is an `initContainer` within the driver Daemonset, which ensures all existing GPU driver clients are disabled before -unloading the current driver modules and continuing with the new driver installation. -This method still automates the core driver upgrade process, but lacks the observability that the upgrade controller provides as well as additional -controls such as pausing/skipping upgrades. -In addition, no new features will be added to the `k8s-driver-manager` moving forward in favor of the upgrade controller. - -### Procedure -1. Upgrade the driver by changing `driver.version` value in ClusterPolicy: - - ```console - $ kubectl patch clusterpolicies.nvidia.com/cluster-policy --type='json' -p='[{"op": "replace", "path": "/spec/driver/version", "value":"580.95.05"},{"op": "replace", "path": "/spec/driver/repository", "value":"nvcr.io/nvidia"},{"op": "replace", "path": "/spec/driver/image", "value":"driver"}]' - ``` +## Activation -2. (Optional) To monitor the status of the upgrade, watch the deployment of the new driver pod on GPU worker nodes: +Do this first: identify which phase the user's request maps to in the Phases +table below, then **read the corresponding `references/.md` file before +acting**. All `kubectl patch` cluster-policy commands, `upgradePolicy` / +`driver.manager` configuration blocks, the state-machine reference, metrics, and +troubleshooting commands live only in those reference files — do not improvise +commands from this dispatch layer. - ```console - $ kubectl get pods -n gpu-operator -lapp=nvidia-driver-daemonset -w - ``` +## Phases -### Configuration Options +| Phase | Summary | Reference | +|-------|---------|-----------| +| Concepts | Why driver upgrades need special handling (kernel-module unload/reload) and the five-step upgrade sequence the Operator automates; containerized-only scope. | [references/concepts.md](references/concepts.md) | +| Upgrade controller (recommended) | Patch `driver.version` (plus repo/image on OpenShift), monitor per-node `gpu-driver-upgrade-state`, the full `upgradePolicy` config (maxParallel/maxUnavailable/gpuPodDeletion/drain), the upgrade state machine, pausing/skipping, Prometheus metrics, and troubleshooting. | [references/upgrade-controller.md](references/upgrade-controller.md) | +| Without the upgrade controller | The legacy `k8s-driver-manager` init-container path: patch `driver.version`, watch the daemonset rollout, and the `driver.manager` env configuration (GPU-pod-eviction/auto-drain/OnDelete strategy). | [references/without-controller.md](references/without-controller.md) | -The following configuration options are available for `k8s-driver-manager`. The options allow users to control the -GPU pod eviction and node drain behavior. +## Hard rules (apply across all phases) -```yaml -driver: - manager: - env: - - name: ENABLE_GPU_POD_EVICTION - value: "true" - - name: ENABLE_AUTO_DRAIN - value: "true" - - name: DRAIN_USE_FORCE - value: "false" - - name: DRAIN_POD_SELECTOR_LABEL - value: "" - - name: DRAIN_TIMEOUT_SECONDS - value: "0s" - - name: DRAIN_DELETE_EMPTYDIR_DATA - value: "false" -``` +- The GPU Operator only manages containerized drivers; host-preinstalled drivers are not upgraded by the Operator. +- The upgrade controller is the recommended path and is enabled by default; no new features are being added to `k8s-driver-manager`. +- `driver.upgradePolicy.drain.enable=true` is cluster-wide and disruptive (evicts all pods, including non-GPU workloads); enable only as a fallback when `gpuPodDeletion` cannot remove all GPU pods, and scope it with `podSelector`. +- On OpenShift, patch `driver.version`, `driver.repository`, and `driver.image` together. +- Driver version values shown (e.g., `580.95.05`) are illustrative; use the version appropriate to your deployment. -* The `ENABLE_GPU_POD_EVICTION` environment variable enables `k8s-driver-manager` to attempt evicting only GPU pods from the node before attempting a node drain. Only if this fails and - `ENABLE_AUTO_DRAIN` is enabled will the node ever be drained. -* The `DRAIN_USE_FORCE` environment variable must be enabled to evict GPU pods that are not managed by any of the replication controllers such as deployment, daemon set, stateful set, and replica set. -* The `DRAIN_DELETE_EMPTYDIR_DATA` environment variable must be enabled to delete GPU pods that use the `emptyDir` type volume. +## Verification -> [!NOTE] -> Since GPU pods get evicted whenever the NVIDIA Driver daemon set specification is updated, it might not always be desirable to allow this to happen automatically. -> To prevent this `daemonsets.updateStrategy` parameter in the `ClusterPolicy` can be set to [OnDelete](https://kubernetes.io/docs/tasks/manage-daemon/update-daemon-set/#daemonset-update-strategy) . -> With `OnDelete` update strategy, a new driver pod with the updated spec will only get deployed on a node once the old driver pod is manually deleted. -> Thus, admins can control when to rollout spec updates to driver pods on any given node. -> For more information on DaemonSet update strategies, refer to the [Kubernetes documentation](https://kubernetes.io/docs/tasks/manage-daemon/update-daemon-set/#daemonset-update-strategy). +Poll the per-node `nvidia.com/gpu-driver-upgrade-state` label until every GPU +node reports `upgrade-done`. The exact `kubectl` command and the +`upgrade-failed` troubleshooting flow are in +[references/upgrade-controller.md](references/upgrade-controller.md). diff --git a/gpu-operator/.agents/skills/gpu-operator-driver-upgrades/references/concepts.md b/gpu-operator/.agents/skills/gpu-operator-driver-upgrades/references/concepts.md new file mode 100644 index 000000000..73e35675f --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-driver-upgrades/references/concepts.md @@ -0,0 +1,19 @@ + + + +# About Upgrading the GPU Driver + +The NVIDIA driver daemon set requires special consideration for upgrades because the driver kernel modules must be unloaded and loaded again on each driver container restart. +Consequently, the following steps must occur across a driver upgrade: + +1. Disable all clients to the GPU driver. +1. Unload the current GPU driver kernel modules. +1. Start the updated GPU driver pod. +1. Install the updated GPU driver and load the updated kernel modules. +1. Enable the clients of the GPU driver. + +The GPU Operator supports several methods for managing and automating this driver upgrade process. + +> [!NOTE] +> The GPU Operator only manages the lifecycle of containerized drivers. +> Drivers which are pre-installed on the host are not managed by the GPU Operator. diff --git a/gpu-operator/.agents/skills/gpu-operator-driver-upgrades/references/upgrade-controller.md b/gpu-operator/.agents/skills/gpu-operator-driver-upgrades/references/upgrade-controller.md new file mode 100644 index 000000000..7684b154f --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-driver-upgrades/references/upgrade-controller.md @@ -0,0 +1,238 @@ + + + +# Upgrades with the Upgrade Controller + +NVIDIA recommends upgrading by using the upgrade controller and the controller is enabled by default in the GPU Operator. +The controller automates the upgrade process and generates metrics and events so that you can monitor the upgrade process. + +## Procedure + +1. Upgrade the driver by changing the `driver.version` value in the cluster policy: + + ```console + $ kubectl patch clusterpolicies.nvidia.com/cluster-policy \ + --type='json' \ + -p='[{"op": "replace", "path": "/spec/driver/version", "value":"580.95.05"}]' + ``` + + If you are using Openshift, you must update the `driver.version`, `driver.repository` and `driver.image` values in the cluster policy. + + ```console + $ kubectl patch clusterpolicies.nvidia.com/cluster-policy \ + --type='json' \ + -p='[{"op": "replace", "path": "/spec/driver/version", "value":"580.95.05"},{"op": "replace", "path": "/spec/driver/repository", "value":"nvcr.io/nvidia"},{"op": "replace", "path": "/spec/driver/image", "value":"driver"}]' + ``` + +2. (Optional) For each node, monitor the upgrade status: + + ```console + $ kubectl get node -l nvidia.com/gpu.present \ + -ojsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.labels.nvidia\.com/gpu-driver-upgrade-state}{"\n"}{end}' + ``` + + *Example Output* + + ```output + k8s-node-1 upgrade-required + k8s-node-2 upgrade-required + k8s-node-3 upgrade-required + ``` + + You can periodically poll the upgrade status by running the preceding command. + The GPU driver upgrade is complete when the output shows `upgrade-done`: + + ```output + k8s-node-1 upgrade-done + k8s-node-2 upgrade-done + k8s-node-3 upgrade-done + ``` + +## Configuration Options + +You can set the following fields in the cluster policy to configure the upgrade controller: + +```yaml +driver: + + upgradePolicy: + # autoUpgrade (default=true): Switch which enables / disables the driver upgrade controller. + # If set to false all other options are ignored. + autoUpgrade: true + # maxParallelUpgrades (default=1): Number of nodes that can be upgraded in parallel. 0 means infinite. + maxParallelUpgrades: 1 + # maximum number of nodes with the driver installed, that can be unavailable during + # the upgrade. Value can be an absolute number (ex: 5) or + # a percentage of total nodes at the start of upgrade (ex: + # 10%). Absolute number is calculated from percentage by rounding + # up. By default, a fixed value of 25% is used.' + maxUnavailable: 25% + # waitForCompletion: Options for the 'wait-for-completion' state, which will wait for a user-defined group of pods + # to complete before upgrading the driver on a node. + waitForCompletion: + # timeoutSeconds (default=0): The length of time to wait before giving up. 0 means infinite. + timeoutSeconds: 0 + # podSelector (default=""): The label selector defining the group of pods to wait for completion of. "" means to wait on none. + podSelector: "" + + # gpuPodDeletion: Options for the 'pod-deletion' state, which will evict all pods on the node allocated a GPU. + gpuPodDeletion: + # force (default=false): Delete pods even if they are not managed by a controller (for example ReplicationController, ReplicaSet, + # Job, DaemonSet or StatefulSet). + force: false + # timeoutSeconds (default=300): The length of time to wait before giving up. 0 means infinite. When the timeout is met, + # the GPU pod(s) will be forcefully deleted. + timeoutSeconds: 300 + # deleteEmptyDir (default=false): Delete pods even if they are using emptyDir volumes (local data will be deleted). + deleteEmptyDir: false + + # drain: Options for the 'drain' state, which invokes 'kubectl drain' on the node. + # Unlike 'gpuPodDeletion', which targets only GPU-allocated pods, drain evicts all pods on the node. + # This should only be enabled as a fallback when 'gpuPodDeletion' cannot remove all GPU-using pods on its own. + drain: + # enable (default=false): Set to true to allow node drain as a fallback when + # 'gpuPodDeletion' cannot evict all GPU pods. By default, drain evicts all pods + # on the node. Use podSelector to limit which pods are evicted. + enable: false + # force (default=false): Delete pods even if they are not managed by a controller + # (for example, ReplicationController, ReplicaSet, Job, DaemonSet, or StatefulSet). + # Applies to all pods on the node, not just GPU pods. + force: false + # podSelector (default=""): Label selector to restrict which pods are evicted + # during drain. An empty string means all pods on the node are evicted. + podSelector: "" + # timeoutSeconds (default=300): The length of time to wait before giving up. + # 0 means infinite. When the timeout is reached, the drain attempt is abandoned. + timeoutSeconds: 300 + # deleteEmptyDir (default=false): Allow eviction of pods that use emptyDir volumes. + # Enabling this results in permanent loss of any data stored in those volumes. + deleteEmptyDir: false +``` + +> [!WARNING] +> `driver.upgradePolicy.drain.enable` is a cluster-wide policy setting. +> When set to `true`, the upgrade controller drains each node before upgrading the driver on that node. +> Draining a node evicts all pods from that node, including workloads unrelated to the GPU driver. +> This is a disruptive operation that interrupts running GPU and non-GPU workloads on every node the upgrade controller processes. + +Enable `drain` only when `gpuPodDeletion` is insufficient to remove all GPU-using pods on its own. +Adjust the `gpuPodDeletion` settings first and use `drain` only if those settings do not work. +If you must enable `drain`, use `podSelector` to limit which pods are evicted. +If you specify a value for `maxUnavailable` and also specify `maxParallelUpgrades`, +the `maxUnavailable` value applies an additional constraint on the value of +`maxParallelUpgrades` to ensure that the number of parallel upgrades does not +cause more than the intended number of nodes to become unavailable during the upgrade. +For example, if you specify `maxUnavailable=100%` and `maxParallelUpgrades=1`, +one node is upgraded at a time . + +The `maxUnavailable` value also applies to the currently unavailable nodes in the cluster. +If you cordoned nodes in the cluster and the `maxUnavailable` value is already met by the number of cordoned nodes, +then the upgrade does not progress. + +## Upgrade State Machine + +The upgrade controller manages driver upgrades through a well-defined state machine. +The node label, `nvidia.com/gpu-driver-upgrade-state`, indicates the state a node is currently in. +The set of possible states are: + +* Unknown (empty): The upgrade controller is disabled or the node has not been processed yet. +* `upgrade-required`: NVIDIA driver pod is not up-to-date and requires an upgrade. No actions are performed at this stage. +* `cordon-required`: Node will be marked Unschedulable in preparation for the driver upgrade. +* `wait-for-jobs-required`: Node will wait on the completion of a group of pods/jobs before proceeding. +* `pod-deletion-required`: Pods allocated with GPUs are deleted from the node. If pod deletion fails, the node state is set to `drain-required` + if drain is enabled in ClusterPolicy. +* `drain-required`: Node is drained using `kubectl drain`, which evicts all pods on the + node. + This state is only reached if `gpuPodDeletion` fails to remove all + GPU-using pods and `drain.enable` is set to `true` in the cluster policy. + This state is skipped if all GPU pods are successfully deleted from the node. +* `pod-restart-required`: The NVIDIA driver pod running on the node will be restarted and upgraded to the new version. +* `validation-required`: Validation of the new driver deployed on the node is required before proceeding. The GPU Operator + performs validations in the pod named `operator-validator`. +* `uncordon-required`: Node will be marked Schedulable to complete the upgrade process. +* `upgrade-done`: NVIDIA driver pod is up-to-date and running on the node. +* `upgrade-failed`: A failure occurred during the driver upgrade. + +The complete state machine is depicted in the diagram below. + +![](graphics/upgrade-controller-state-machine.png) + +## Pausing Driver Upgrades + +To pause the automatic driver upgrade process in the cluster, toggle `driver.upgradePolicy.autoUpgrade` flag +in the cluster policy. +The entire state machine pauses and effectively disables any pending nodes from being upgraded. +You can toggle the flag to `true` again to re-enable the upgrade controller and resume any pending upgrades. + +## Skipping Driver Upgrades + +To skip driver upgrades on a certain node, label the node with `nvidia.com/gpu-driver-upgrade.skip=true`. + +## Metrics and Events + +The GPU Operator generates the following metrics during the upgrade process which can be scraped by Prometheus. + +* `gpu_operator_auto_upgrade_enabled`: 1 if driver auto upgrade is enabled; 0 if not. +* `gpu_operator_nodes_upgrades_in_progress`: Total number of nodes in which a driver pod is being upgraded on. +* `gpu_operator_nodes_upgrades_done`: Total number of nodes in which a driver pod has been successfully upgraded. +* `gpu_operator_nodes_upgrades_failed`: Total number of nodes in which a driver pod upgrade has failed. +* `gpu_operator_nodes_upgrades_available`: Total number of nodes in which a driver pod upgrade can start on. +* `gpu_operator_nodes_upgrades_pending`: Total number of nodes in which driver pod upgrades are pending. + +The GPU Operator generates events during the upgrade process. +The most common events are for state transitions or failures at a particular state. +Below are an example set of events generated for the upgrade of one node. + +```console +$ kubectl get events -n default --sort-by='.lastTimestamp' | grep GPUDriverUpgrade +``` + +*Example Output* + +```output +10m Normal GPUDriverUpgrade node/localhost.localdomain Successfully updated node state label to [upgrade-required] +10m Normal GPUDriverUpgrade node/localhost.localdomain Successfully updated node state label to [cordon-required] +10m Normal GPUDriverUpgrade node/localhost.localdomain Successfully updated node state label to [wait-for-jobs-required] +10m Normal GPUDriverUpgrade node/localhost.localdomain Successfully updated node state label to [pod-deletion-required] +10m Normal GPUDriverUpgrade node/localhost.localdomain Successfully updated node state label to [pod-restart-required] +7m Normal GPUDriverUpgrade node/localhost.localdomain Successfully updated node state label to [validation-required] +6m Normal GPUDriverUpgrade node/localhost.localdomain Successfully updated node state label to [uncordon-required] +6m Normal GPUDriverUpgrade node/localhost.localdomain Successfully updated node state label to [upgrade-done] +``` + +## Troubleshooting + +If the upgrade fails for a particular node, the node is labelled with the `upgrade-failed` state. + +1. View the upgrade state labels: + + ```console + $ kubectl get node -l nvidia.com/gpu.present \ + -ojsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.labels.nvidia\.com/gpu-driver-upgrade-state}{"\n"}{end}' + ``` + + *Example Output* + + ```output + k8s-node-1 upgrade-done + k8s-node-2 upgrade-done + k8s-node-3 upgrade-failed + ``` + +1. Check the events to determine the stage that the upgrade failed: + + ```console + $ kubectl get events -n default --sort-by='.lastTimestamp' | grep GPUDriverUpgrade + ``` + +1. (Optional) Check the logs from the upgrade controller in the gpu-operator container: + + ```console + $ kubectl logs -n gpu-operator gpu-operator-xxxxx | grep controllers.Upgrade + ``` + +1. After resolving the upgrade failures for a particular node, you can restart the upgrade process on the node by placing it in the `upgrade-required` state: + + ```console + $ kubectl label node nvidia.com/gpu-driver-upgrade-state=upgrade-required --overwrite + ``` diff --git a/gpu-operator/.agents/skills/gpu-operator-driver-upgrades/references/without-controller.md b/gpu-operator/.agents/skills/gpu-operator-driver-upgrades/references/without-controller.md new file mode 100644 index 000000000..4d935c758 --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-driver-upgrades/references/without-controller.md @@ -0,0 +1,61 @@ + + + +# Upgrades without the Upgrade Controller + +If the upgrade controller is disabled or not supported for your GPU Operator version, a component called `k8s-driver-manager` is responsible +for executing the driver upgrade process. +The `k8s-driver-manager` is an `initContainer` within the driver Daemonset, which ensures all existing GPU driver clients are disabled before +unloading the current driver modules and continuing with the new driver installation. +This method still automates the core driver upgrade process, but lacks the observability that the upgrade controller provides as well as additional +controls such as pausing/skipping upgrades. +In addition, no new features will be added to the `k8s-driver-manager` moving forward in favor of the upgrade controller. + +## Procedure + +1. Upgrade the driver by changing `driver.version` value in ClusterPolicy: + + ```console + $ kubectl patch clusterpolicies.nvidia.com/cluster-policy --type='json' -p='[{"op": "replace", "path": "/spec/driver/version", "value":"580.95.05"},{"op": "replace", "path": "/spec/driver/repository", "value":"nvcr.io/nvidia"},{"op": "replace", "path": "/spec/driver/image", "value":"driver"}]' + ``` + +2. (Optional) To monitor the status of the upgrade, watch the deployment of the new driver pod on GPU worker nodes: + + ```console + $ kubectl get pods -n gpu-operator -lapp=nvidia-driver-daemonset -w + ``` + +## Configuration Options + +The following configuration options are available for `k8s-driver-manager`. The options allow users to control the +GPU pod eviction and node drain behavior. + +```yaml +driver: + manager: + env: + - name: ENABLE_GPU_POD_EVICTION + value: "true" + - name: ENABLE_AUTO_DRAIN + value: "true" + - name: DRAIN_USE_FORCE + value: "false" + - name: DRAIN_POD_SELECTOR_LABEL + value: "" + - name: DRAIN_TIMEOUT_SECONDS + value: "0s" + - name: DRAIN_DELETE_EMPTYDIR_DATA + value: "false" +``` + +* The `ENABLE_GPU_POD_EVICTION` environment variable enables `k8s-driver-manager` to attempt evicting only GPU pods from the node before attempting a node drain. Only if this fails and + `ENABLE_AUTO_DRAIN` is enabled will the node ever be drained. +* The `DRAIN_USE_FORCE` environment variable must be enabled to evict GPU pods that are not managed by any of the replication controllers such as deployment, daemon set, stateful set, and replica set. +* The `DRAIN_DELETE_EMPTYDIR_DATA` environment variable must be enabled to delete GPU pods that use the `emptyDir` type volume. + +> [!NOTE] +> Since GPU pods get evicted whenever the NVIDIA Driver daemon set specification is updated, it might not always be desirable to allow this to happen automatically. +> To prevent this `daemonsets.updateStrategy` parameter in the `ClusterPolicy` can be set to [OnDelete](https://kubernetes.io/docs/tasks/manage-daemon/update-daemon-set/#daemonset-update-strategy) . +> With `OnDelete` update strategy, a new driver pod with the updated spec will only get deployed on a node once the old driver pod is manually deleted. +> Thus, admins can control when to rollout spec updates to driver pods on any given node. +> For more information on DaemonSet update strategies, refer to the [Kubernetes documentation](https://kubernetes.io/docs/tasks/manage-daemon/update-daemon-set/#daemonset-update-strategy). diff --git a/gpu-operator/.agents/skills/gpu-operator-upgrading-nvidia/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-upgrading-nvidia/SKILL.md index 19425afb6..7a816f7ea 100644 --- a/gpu-operator/.agents/skills/gpu-operator-upgrading-nvidia/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-upgrading-nvidia/SKILL.md @@ -20,6 +20,12 @@ tags: # Upgrading the NVIDIA GPU Operator +Upgrade an existing GPU Operator installation. Because Helm does not +automatically upgrade existing CRDs, you either upgrade the CRDs manually before +`helm upgrade` or let the default `pre-upgrade` Helm hook do it. The Operator +also supports dynamic `ClusterPolicy` edits, and the driver daemonset has +additional upgrade considerations. + ## Prerequisites - A Kubernetes cluster with an existing NVIDIA GPU Operator installation and the `kubectl` and `helm` CLIs available. @@ -29,182 +35,31 @@ tags: $ kubectl label --overwrite ns gpu-operator pod-security.kubernetes.io/enforce=privileged ``` -## Using Helm - -The GPU Operator supports dynamic updates to existing resources. -This ability enables the GPU Operator to ensure settings from the cluster policy specification are always applied and current. - -Because Helm [does not support](https://helm.sh/docs/chart_best_practices/custom_resource_definitions/#some-caveats-and-explanations) automatic upgrade of existing CRDs, -you can upgrade the GPU Operator chart manually or by enabling a Helm hook. - -### Option 1: Manually Upgrading CRDs - - ```mermaid - flowchart LR - - A["Update CRD from - the latest chart"] - --> - B["Upgrade by - using Helm"] - ``` - -With this procedure, all existing GPU Operator resources are updated inline and the cluster policy resource is patched with updates from `values.yaml`. - -> [!NOTE] -> Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). - -1. Specify the Operator release tag in an environment variable: - - ```console - $ export RELEASE_TAG= - ``` - -1. Apply the custom resource definitions for the cluster policy and NVIDIA driver: - - ```console - $ kubectl apply -f \ - https://raw.githubusercontent.com/NVIDIA/gpu-operator/refs/tags/$RELEASE_TAG/deployments/gpu-operator/crds/nvidia.com_clusterpolicies.yaml - - $ kubectl apply -f \ - https://raw.githubusercontent.com/NVIDIA/gpu-operator/refs/tags/$RELEASE_TAG/deployments/gpu-operator/crds/nvidia.com_nvidiadrivers.yaml - ``` - - *Example Output* - - ```output - customresourcedefinition.apiextensions.k8s.io/clusterpolicies.nvidia.com configured - customresourcedefinition.apiextensions.k8s.io/nvidiadrivers.nvidia.com created - ``` - -1. Apply the custom resource definition for Node Feature Discovery: - - ```console - $ kubectl apply -f \ - https://raw.githubusercontent.com/NVIDIA/gpu-operator/refs/tags/$RELEASE_TAG/deployments/gpu-operator/charts/node-feature-discovery/crds/nfd-api-crds.yaml - ``` - - *Example Output* - - ```output - customresourcedefinition.apiextensions.k8s.io/nodefeaturerules.nfd.k8s-sigs.io configured - ``` - -1. Update the information about the Operator chart: - - ```console - $ helm repo update nvidia - ``` - - *Example Output* - - ```output - Hang tight while we grab the latest from your chart repositories... - ...Successfully got an update from the "nvidia" chart repository - Update Complete. ⎈Happy Helming!⎈ - ``` - -1. Fetch the values from the chart: - - ```console - $ helm show values nvidia/gpu-operator --version=$RELEASE_TAG > values-$RELEASE_TAG.yaml - ``` - -1. Update the values file as needed. +## Activation -1. Upgrade the Operator: +Do this first: identify which phase the user's request maps to in the Phases +table below, then **read the corresponding `references/.md` file before +acting**. All `kubectl`/`helm` command sequences, CRD-apply URLs, and +verification output live only in those reference files — do not improvise +commands from this dispatch layer. - ```console - $ helm upgrade gpu-operator nvidia/gpu-operator -n gpu-operator -f values-$RELEASE_TAG.yaml --version $RELEASE_TAG - ``` +## Phases - *Example Output* +| Phase | Summary | Reference | +|-------|---------|-----------| +| Helm upgrade | Both CRD-handling options: Option 1 (manually apply the clusterpolicies/nvidiadrivers/NFD CRDs, then `helm upgrade`) and Option 2 (the default `pre-upgrade` Helm hook, using `--disable-openapi-validation`). | [references/helm-upgrade.md](references/helm-upgrade.md) | +| Cluster policy, driver controls, OLM, verification | Dynamic `ClusterPolicy` edits via `kubectl edit`, the pointer to driver-daemonset upgrade considerations, the OpenShift OLM upgrade path, and post-upgrade health verification. | [references/other-updates.md](references/other-updates.md) | - ```output - Release "gpu-operator" has been upgraded. Happy Helming! - NAME: gpu-operator - LAST DEPLOYED: Thu Apr 20 15:05:52 2023 - NAMESPACE: gpu-operator - STATUS: deployed - REVISION: 2 - TEST SUITE: None - ``` +## Hard rules (apply across all phases) -### Option 2: Automatically Upgrading CRDs Using a Helm Hook - -Starting with GPU Operator v22.09, a `pre-upgrade` Helm [hook](https://helm.sh/docs/topics/charts_hooks/#the-available-hooks) can automatically upgrade to latest CRD. - -Starting with GPU Operator v24.9.0, the upgrade CRD Helm hook is enabled by default and runs an upgrade CRD job when you upgrade using Helm. - -1. Specify the Operator release tag in an environment variable: - - ```console - $ export RELEASE_TAG= - ``` - -1. Update the information about the Operator chart: - - ```console - $ helm repo update nvidia - ``` - - *Example Output* - - ```output - Hang tight while we grab the latest from your chart repositories... - ...Successfully got an update from the "nvidia" chart repository - Update Complete. ⎈Happy Helming!⎈ - ``` - -1. Fetch the values from the chart: - - ```console - $ helm show values nvidia/gpu-operator --version=$RELEASE_TAG > values-$RELEASE_TAG.yaml - ``` - -1. Update the values file as needed. - -1. Upgrade the Operator: - - ```console - $ helm upgrade gpu-operator nvidia/gpu-operator -n gpu-operator \ - --disable-openapi-validation -f values-$RELEASE_TAG.yaml --version $RELEASE_TAG - ``` - - > [!NOTE] - > * Option `--disable-openapi-validation` is required in this case so that Helm will not try to validate if CR instance from the new chart is valid as per old CRD. - > Since CR instance in the Chart is valid for the upgraded CRD, this will be compatible. - - * Helm hooks used with the GPU Operator use the operator image itself. If operator image itself cannot be pulled successfully (either due to network error or an invalid NGC registry secret in case of NVAIE), hooks will fail. - In this case, chart needs to be deleted using `--no-hooks` option to avoid deletion to be hung on hook failures. - -## Cluster Policy Updates - -The GPU Operator also supports dynamic updates to the `ClusterPolicy` CustomResource using `kubectl`: - -```console -$ kubectl edit clusterpolicy -``` - -After the edits are complete, Kubernetes will automatically apply the updates to cluster. - -## Additional Controls for Driver Upgrades - -While most of the GPU Operator managed daemonsets can be upgraded seamlessly, the NVIDIA driver daemonset has special considerations. -Refer to the GPU driver upgrade behavior (use the `gpu-operator-driver-upgrades` skill) for more information. - -## Using Operator Lifecycle Manager (OLM) in OpenShift - -For upgrading the GPU Operator when running in OpenShift, refer to the official OpenShift documentation on [upgrading installed operators](https://docs.redhat.com/en/documentation/openshift_container_platform/latest/html/operators/administrator-tasks#olm-upgrading-operators). +- Helm does not auto-upgrade existing CRDs; either apply them manually (Option 1) or rely on the default Helm hook (Option 2, default since v24.9.0). +- Option 2 requires `--disable-openapi-validation` so Helm does not validate the new CR against the old CRD. +- Helm hooks run the Operator image; if it cannot be pulled, delete with `--no-hooks` to avoid a hung deletion. +- The NVIDIA driver daemonset has special upgrade behavior — see the `gpu-operator-driver-upgrades` skill. +- Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). Never hardcode a version. ## Verification -After upgrading, confirm that the Operator and its operands are healthy: - -1. Confirm all GPU Operator pods are running or completed: - - ```console - $ kubectl get pods -n gpu-operator - ``` - - The `nvidia-operator-validator` pod should report `Completed`, and the driver, toolkit, and device-plugin pods should report `Running` on the expected GPU nodes. +After upgrade, confirm `nvidia-operator-validator` is `Completed` and the +driver/toolkit/device-plugin pods are `Running`. Exact commands are in +[references/other-updates.md](references/other-updates.md). diff --git a/gpu-operator/.agents/skills/gpu-operator-upgrading-nvidia/references/helm-upgrade.md b/gpu-operator/.agents/skills/gpu-operator-upgrading-nvidia/references/helm-upgrade.md new file mode 100644 index 000000000..771c35f3f --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-upgrading-nvidia/references/helm-upgrade.md @@ -0,0 +1,151 @@ + + + +# Using Helm + +The GPU Operator supports dynamic updates to existing resources. +This ability enables the GPU Operator to ensure settings from the cluster policy specification are always applied and current. + +Because Helm [does not support](https://helm.sh/docs/chart_best_practices/custom_resource_definitions/#some-caveats-and-explanations) automatic upgrade of existing CRDs, +you can upgrade the GPU Operator chart manually or by enabling a Helm hook. + +> [!NOTE] +> Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). + +## Option 1: Manually Upgrading CRDs + + ```mermaid + flowchart LR + + A["Update CRD from + the latest chart"] + --> + B["Upgrade by + using Helm"] + ``` + +With this procedure, all existing GPU Operator resources are updated inline and the cluster policy resource is patched with updates from `values.yaml`. + +1. Specify the Operator release tag in an environment variable: + + ```console + $ export RELEASE_TAG= + ``` + +1. Apply the custom resource definitions for the cluster policy and NVIDIA driver: + + ```console + $ kubectl apply -f \ + https://raw.githubusercontent.com/NVIDIA/gpu-operator/refs/tags/$RELEASE_TAG/deployments/gpu-operator/crds/nvidia.com_clusterpolicies.yaml + + $ kubectl apply -f \ + https://raw.githubusercontent.com/NVIDIA/gpu-operator/refs/tags/$RELEASE_TAG/deployments/gpu-operator/crds/nvidia.com_nvidiadrivers.yaml + ``` + + *Example Output* + + ```output + customresourcedefinition.apiextensions.k8s.io/clusterpolicies.nvidia.com configured + customresourcedefinition.apiextensions.k8s.io/nvidiadrivers.nvidia.com created + ``` + +1. Apply the custom resource definition for Node Feature Discovery: + + ```console + $ kubectl apply -f \ + https://raw.githubusercontent.com/NVIDIA/gpu-operator/refs/tags/$RELEASE_TAG/deployments/gpu-operator/charts/node-feature-discovery/crds/nfd-api-crds.yaml + ``` + + *Example Output* + + ```output + customresourcedefinition.apiextensions.k8s.io/nodefeaturerules.nfd.k8s-sigs.io configured + ``` + +1. Update the information about the Operator chart: + + ```console + $ helm repo update nvidia + ``` + + *Example Output* + + ```output + Hang tight while we grab the latest from your chart repositories... + ...Successfully got an update from the "nvidia" chart repository + Update Complete. ⎈Happy Helming!⎈ + ``` + +1. Fetch the values from the chart: + + ```console + $ helm show values nvidia/gpu-operator --version=$RELEASE_TAG > values-$RELEASE_TAG.yaml + ``` + +1. Update the values file as needed. + +1. Upgrade the Operator: + + ```console + $ helm upgrade gpu-operator nvidia/gpu-operator -n gpu-operator -f values-$RELEASE_TAG.yaml --version $RELEASE_TAG + ``` + + *Example Output* + + ```output + Release "gpu-operator" has been upgraded. Happy Helming! + NAME: gpu-operator + LAST DEPLOYED: Thu Apr 20 15:05:52 2023 + NAMESPACE: gpu-operator + STATUS: deployed + REVISION: 2 + TEST SUITE: None + ``` + +## Option 2: Automatically Upgrading CRDs Using a Helm Hook + +Starting with GPU Operator v22.09, a `pre-upgrade` Helm [hook](https://helm.sh/docs/topics/charts_hooks/#the-available-hooks) can automatically upgrade to latest CRD. + +Starting with GPU Operator v24.9.0, the upgrade CRD Helm hook is enabled by default and runs an upgrade CRD job when you upgrade using Helm. + +1. Specify the Operator release tag in an environment variable: + + ```console + $ export RELEASE_TAG= + ``` + +1. Update the information about the Operator chart: + + ```console + $ helm repo update nvidia + ``` + + *Example Output* + + ```output + Hang tight while we grab the latest from your chart repositories... + ...Successfully got an update from the "nvidia" chart repository + Update Complete. ⎈Happy Helming!⎈ + ``` + +1. Fetch the values from the chart: + + ```console + $ helm show values nvidia/gpu-operator --version=$RELEASE_TAG > values-$RELEASE_TAG.yaml + ``` + +1. Update the values file as needed. + +1. Upgrade the Operator: + + ```console + $ helm upgrade gpu-operator nvidia/gpu-operator -n gpu-operator \ + --disable-openapi-validation -f values-$RELEASE_TAG.yaml --version $RELEASE_TAG + ``` + + > [!NOTE] + > * Option `--disable-openapi-validation` is required in this case so that Helm will not try to validate if CR instance from the new chart is valid as per old CRD. + > Since CR instance in the Chart is valid for the upgraded CRD, this will be compatible. + + * Helm hooks used with the GPU Operator use the operator image itself. If operator image itself cannot be pulled successfully (either due to network error or an invalid NGC registry secret in case of NVAIE), hooks will fail. + In this case, chart needs to be deleted using `--no-hooks` option to avoid deletion to be hung on hook failures. diff --git a/gpu-operator/.agents/skills/gpu-operator-upgrading-nvidia/references/other-updates.md b/gpu-operator/.agents/skills/gpu-operator-upgrading-nvidia/references/other-updates.md new file mode 100644 index 000000000..1bb647d9d --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-upgrading-nvidia/references/other-updates.md @@ -0,0 +1,35 @@ + + + +# Cluster Policy Updates, Driver Controls, OLM, and Verification + +## Cluster Policy Updates + +The GPU Operator also supports dynamic updates to the `ClusterPolicy` CustomResource using `kubectl`: + +```console +$ kubectl edit clusterpolicy +``` + +After the edits are complete, Kubernetes will automatically apply the updates to cluster. + +## Additional Controls for Driver Upgrades + +While most of the GPU Operator managed daemonsets can be upgraded seamlessly, the NVIDIA driver daemonset has special considerations. +Refer to the GPU driver upgrade behavior (use the `gpu-operator-driver-upgrades` skill) for more information. + +## Using Operator Lifecycle Manager (OLM) in OpenShift + +For upgrading the GPU Operator when running in OpenShift, refer to the official OpenShift documentation on [upgrading installed operators](https://docs.redhat.com/en/documentation/openshift_container_platform/latest/html/operators/administrator-tasks#olm-upgrading-operators). + +## Verification + +After upgrading, confirm that the Operator and its operands are healthy: + +1. Confirm all GPU Operator pods are running or completed: + + ```console + $ kubectl get pods -n gpu-operator + ``` + + The `nvidia-operator-validator` pod should report `Completed`, and the driver, toolkit, and device-plugin pods should report `Running` on the expected GPU nodes. From 67e0a07359143e7fe6c407d2cff748be4baffae7 Mon Sep 17 00:00:00 2001 From: Andrew Chen Date: Mon, 1 Jun 2026 22:03:35 -0700 Subject: [PATCH 13/13] refactor(skills): info-hiding restructure batch 5 (precompiled, vgpu, airgapped, dra) Restructure the 4 remaining (largest) procedural GPU Operator skills to the dispatch-layer information-hiding pattern: thin SKILL.md (<200 lines) with all step-by-step detail moved into phase-specific references/*.md. Skills: precompiled-drivers (concepts, availability, enable-disable, build-custom), install-nvidia-vgpu (concepts, build-driver, configure-and-install), install-airgapped-environments (concepts, local-image-registry, local-package-repository, deploy), nvidia-dra (concepts, install-gpu-operator, install-dra-driver, health-checks). All four pass the close-your-eyes test. Verified-fix content preserved (prerequisites, verification, placeholders, EULA/license warnings, cross-refs to install/precompiled skills). This completes the 17-skill remaining set; with the prior 7 done, all 24 procedural gpu-operator skills are now information-hidden. Co-Authored-By: Claude Opus 4.8 (1M context) Signed-off-by: Andrew Chen --- .../SKILL.md | 369 ++---------------- .../references/concepts.md | 62 +++ .../references/deploy.md | 30 ++ .../references/local-image-registry.md | 143 +++++++ .../references/local-package-repository.md | 124 ++++++ .../gpu-operator-install-nvidia-vgpu/SKILL.md | 216 ++-------- .../references/build-driver.md | 100 +++++ .../references/concepts.md | 20 + .../references/configure-and-install.md | 88 +++++ .../skills/gpu-operator-nvidia-dra/SKILL.md | 315 ++------------- .../references/concepts.md | 37 ++ .../references/health-checks.md | 51 +++ .../references/install-dra-driver.md | 162 ++++++++ .../references/install-gpu-operator.md | 58 +++ .../gpu-operator-precompiled-drivers/SKILL.md | 295 ++------------ .../references/availability.md | 45 +++ .../references/build-custom.md | 119 ++++++ .../references/concepts.md | 30 ++ .../references/enable-disable.md | 94 +++++ 19 files changed, 1278 insertions(+), 1080 deletions(-) create mode 100644 gpu-operator/.agents/skills/gpu-operator-install-airgapped-environments/references/concepts.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-install-airgapped-environments/references/deploy.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-install-airgapped-environments/references/local-image-registry.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-install-airgapped-environments/references/local-package-repository.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-install-nvidia-vgpu/references/build-driver.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-install-nvidia-vgpu/references/concepts.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-install-nvidia-vgpu/references/configure-and-install.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-nvidia-dra/references/concepts.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-nvidia-dra/references/health-checks.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-nvidia-dra/references/install-dra-driver.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-nvidia-dra/references/install-gpu-operator.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-precompiled-drivers/references/availability.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-precompiled-drivers/references/build-custom.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-precompiled-drivers/references/concepts.md create mode 100644 gpu-operator/.agents/skills/gpu-operator-precompiled-drivers/references/enable-disable.md diff --git a/gpu-operator/.agents/skills/gpu-operator-install-airgapped-environments/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-install-airgapped-environments/SKILL.md index 0080b381b..b5c8dad1a 100644 --- a/gpu-operator/.agents/skills/gpu-operator-install-airgapped-environments/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-install-airgapped-environments/SKILL.md @@ -21,355 +21,46 @@ tags: # Install NVIDIA GPU Operator in Air-Gapped Environments +Deploy the GPU Operator in clusters with restricted or no internet access. The +Operator normally needs the internet to pull container images and to let the +`driver` container download OS packages. In an air-gapped setup you mirror the +images into a local registry and (unless using precompiled drivers) mirror the +OS packages into a local package repository, then install with a customized +`values.yaml`. + ## Prerequisites - A running Kubernetes cluster with NVIDIA GPU worker nodes that has restricted or no internet access. - A private container registry reachable from the cluster, and a local package repository or HTTP proxy for operating-system packages. - The `kubectl` and `helm` CLIs available on a client machine, plus a workstation with internet access for mirroring images and charts. -## About Air-Gapped Installations - -This page describes how to successfully deploy the GPU Operator in clusters with restricted internet access. -By default, The GPU Operator requires internet access for the following reasons: - - 1) Container images need to be pulled during GPU Operator installation. - 2) The `driver` container needs to download several OS packages prior to driver installation. - - > [!TIP] - > Using precompiled-drivers removes the need for the `driver` containers to - > download operating system packages and removes the need to create a local package repository. - > To address these requirements, it may be necessary to create a local image registry and/or a local package repository - > so that the necessary images and packages are available for your cluster. In subsequent sections, we detail how to - > configure the GPU Operator to use local image registries and local package repositories. If your cluster is behind - > a proxy, also follow the steps from install-gpu-operator-proxy. - -Different steps are required for different environments with varying levels of internet connectivity. -The supported use cases/environments are listed in the below table: - -+--------------------------+-----------------------------------------+ - Network Flow | -+--------------------------+--------------------+--------------------+ - Use Case Pulling Images Pulling Packages -+========+=================+====================+====================+ - **1** HTTP Proxy with K8s node --> HTTP Driver container | - full Internet Proxy --> Internet --> HTTP Proxy --> | - access Image Registry Internet Package | - Repository | -+--------+-----------------+--------------------+--------------------+ - **2** HTTP Proxy with K8s node --> HTTP Driver container | - limited Internet Proxy --> Internet --> HTTP Proxy --> | - access Image Registry Local Package | - Repository | -+--------+-----------------+--------------------+--------------------+ - **3a** Full Air-Gapped K8s node --> Local Driver container | - (w/ HTTP Proxy) Image Registry --> HTTP Proxy --> | - Local Package | - Repository | -+--------+-----------------+--------------------+--------------------+ - **3b** Full Air-Gapped K8s node --> Local Driver container-->| - (w/o HTTP Proxy) Image Registry Local Package | - Repository | -+--------+-----------------+--------------------+--------------------+ - -> [!NOTE] -> For Red Hat Openshift deployments in air-gapped environments (use cases 2, 3a and 3b), -> refer to [Mirror GPU Operator images for disconnected OpenShift](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/mirror-gpu-ocp-disconnected.html). -> [!NOTE] -> Ensure that Kubernetes nodes can successfully reach the local DNS server(s). -> Public name resolution for image registry and package repositories are -> mandatory for use cases 1 and 2. -> Before proceeding to the next sections, get the `values.yaml` file used for GPU Operator configuration. - -```console -$ curl -sO https://raw.githubusercontent.com/NVIDIA/gpu-operator/v1.7.0/deployments/gpu-operator/values.yaml -``` - -> [!NOTE] -> Replace `v1.7.0` in the above command with the version you want to use. - -## Local Image Registry - -Without internet access, the GPU Operator requires all images to be hosted in a local image registry that is accessible -to all nodes in the cluster. To allow the GPU Operator to work with a local registry, users can specify local -repository, image, tag along with pull secrets in `values.yaml`. - -To pull the correct images from the NVIDIA registry, you can leverage the fields `repository`, `image` and `version` -specified in the file `values.yaml`. - -The general syntax for the container image is `/:`. - -If the version is not specified, you can retrieve the information from the NVIDIA NGC catalog at https://catalog.ngc.nvidia.com/containers. -Search for an image, such as `gpu-operator` and then check the available tags for the image. - -> [!NOTE] -> Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). - -An example is shown below with the Operator container image: - -```yaml -operator: - repository: nvcr.io/nvidia - image: gpu-operator - version: "" -``` - -For instance, to pull the gpu-operator image version , use the following instruction: - -```console -$ docker pull nvcr.io/nvidia/gpu-operator: -``` - -There is one caveat with regards to the driver image. The version field must be appended by the OS name running on the worker node. - -```yaml -driver: - repository: nvcr.io/nvidia - image: driver - version: "${recommended}" -``` - -To pull the driver image for Ubuntu 20.04: - -```console -$ docker pull nvcr.io/nvidia/driver:${recommended}-ubuntu20.04 -``` - -To push the images to the local registry, simply tag the pulled images by prefixing the image with the image registry information. - -Using the above examples, this will result in: - -```console -$ docker tag nvcr.io/nvidia/gpu-operator: //gpu-operator: -$ docker tag nvcr.io/nvidia/driver:${recommended}-ubuntu20.04 //driver:${recommended}-ubuntu20.04 -``` - -Finally, push the images to the local registry: - -```console -$ docker push //gpu-operator: -$ docker push //driver:${recommended}-ubuntu20.04 -``` - -Update `values.yaml` with local registry information in the repository field. - -> [!NOTE] -> Replace below with your local image registry URL and port. -> Sample of `values.yaml` for GPU Operator v1.9.0: - -```yaml -operator: - repository: - image: gpu-operator - version: 1.9.0 - imagePullSecrets: [] - initContainer: - image: cuda - repository: - version: 11.4.2-base-ubi8 - - validator: - image: gpu-operator-validator - repository: - version: 1.9.0 - imagePullSecrets: [] - - driver: - repository: - image: driver - version: "470.82.01" - imagePullSecrets: [] - manager: - image: k8s-driver-manager - repository: - version: v0.2.0 - - toolkit: - repository: - image: container-toolkit - version: 1.7.2-ubuntu18.04 - imagePullSecrets: [] - - devicePlugin: - repository: - image: k8s-device-plugin - version: v0.10.0-ubi8 - imagePullSecrets: [] - - dcgmExporter: - repository: - image: dcgm-exporter - version: 2.3.1-2.6.0-ubuntu20.04 - imagePullSecrets: [] - - gfd: - repository: - image: gpu-feature-discovery - version: v0.4.1 - imagePullSecrets: [] - - nodeStatusExporter: - enabled: false - repository: - image: gpu-operator-validator - version: "1.9.0" - - migManager: - enabled: true - repository: - image: k8s-mig-manager - version: v0.2.0-ubuntu20.04 - - node-feature-discovery: - image: - repository: - pullPolicy: IfNotPresent - # tag, if defined will use the given image tag, else Chart.AppVersion will be used - # tag: - imagePullSecrets: [] -``` - -## Local Package Repository - -The `driver` container deployed as part of the GPU Operator requires certain packages to be available as part of the -driver installation. In restricted internet access or air-gapped installations, users are required to create a -local mirror repository for their OS distribution and make the following packages available: - -> [!NOTE] -> KERNEL_VERSION is the underlying running kernel version on the GPU node -> GCC_VERSION is the gcc version matching the one used for building underlying kernel - -Configuring a local package repository is not necessary for clusters that -can run precompiled-drivers. -### Required Packages - -```yaml -ubuntu: - linux-headers-${KERNEL_VERSION} - linux-image-${KERNEL_VERSION} - linux-modules-${KERNEL_VERSION} - -centos: - elfutils-libelf.x86_64 - elfutils-libelf-devel.x86_64 - kernel-headers-${KERNEL_VERSION} - kernel-devel-${KERNEL_VERSION} - kernel-core-${KERNEL_VERSION} - gcc-${GCC_VERSION} - -rhel/rhcos: - kernel-headers-${KERNEL_VERSION} - kernel-devel-${KERNEL_VERSION} - kernel-core-${KERNEL_VERSION} - gcc-${GCC_VERSION} -``` - -For example, for Ubuntu, these packages can be found at `archive.ubuntu.com`. -This is the mirror to be replicate locally for your cluster. -You can use `apt-mirror` to mirror these packages to your local package repository server. - -For CentOS, `reposync` can be used to create the local mirror. - -After all the required packages are mirrored to the local repository, repo lists need to be created following -distribution specific documentation. A `ConfigMap` containing the repo list file needs to be created in -the namespace where the GPU Operator gets deployed. - -An example of repo list is shown below for Ubuntu 22.04 (access to local package repository via HTTP): - -`custom-repo.list`: - -```text -deb [arch=amd64] http:///ubuntu/mirror/archive.ubuntu.com/ubuntu jammy main universe -deb [arch=amd64] http:///ubuntu/mirror/archive.ubuntu.com/ubuntu jammy-updates main universe -deb [arch=amd64] http:///ubuntu/mirror/archive.ubuntu.com/ubuntu jammy-security main universe -``` - -An example of repo list is shown below for Ubuntu 20.04 (access to local package repository via HTTP): - -`custom-repo.list`: - -```text -deb [arch=amd64] http:///ubuntu/mirror/archive.ubuntu.com/ubuntu focal main universe -deb [arch=amd64] http:///ubuntu/mirror/archive.ubuntu.com/ubuntu focal-updates main universe -deb [arch=amd64] http:///ubuntu/mirror/archive.ubuntu.com/ubuntu focal-security main universe -``` - -An example of repo list is shown below for CentOS 8 (access to local package repository via HTTP): - -`custom-repo.repo`: - -```text -[baseos] -name=CentOS Linux $releasever - BaseOS -baseurl=http:///repos/centos/$releasever/$basearch/os/baseos/ -gpgcheck=0 -enabled=1 - -[appstream] -name=CentOS Linux $releasever - AppStream -baseurl=http:///repos/centos/$releasever/$basearch/os/appstream/ -gpgcheck=0 -enabled=1 - -[extras] -name=CentOS Linux $releasever - Extras -baseurl=http:///repos/centos/$releasever/$basearch/os/extras/ -gpgcheck=0 -enabled=1 -``` - -Create a `ConfigMap` object from the file: - -```console -$ kubectl create configmap repo-config -n gpu-operator --from-file= -``` - -Update the `custom-repo.list` file and config map as appropriate if the containerization software platform, such as Tanzu, upgrades the Kubernetes cluster nodes to a newer operating system version. - -After the config map is created, update `values.yaml` with this information to let the GPU Operator mount the repo configuration -within the `driver` container to pull required packages. Based on the OS distribution the GPU Operator automatically mounts this config map into the appropriate directory. - -```yaml -driver: - repoConfig: - configMapName: repo-config -``` - -If self-signed certificates are used for an HTTPS based internal repository then you must add a config map for those certificates. -You then specify the config map during the GPU Operator install. -Based on the OS distribution the GPU Operator automatically mounts this config map into the appropriate directory. -Similarly, the certificate file format and suffix, such as `.crt` or `.pem`, also depends on the OS distribution. - -```console -$ kubectl create configmap cert-config -n gpu-operator --from-file= --from-file= -``` - -```yaml -driver: - certConfig: - name: cert-config -``` - -## Deploy GPU Operator +## Activation -Download and deploy GPU Operator Helm Chart with the updated `values.yaml`. +Do this first: identify which phase the user's request maps to in the Phases +table below, then **read the corresponding `references/.md` file before +acting**. All command sequences, the use-case matrix, full `values.yaml` +samples, repo-list/cert ConfigMap manifests, and the deploy commands live only +in those reference files — do not improvise commands from this dispatch layer. -Fetch the chart from the NGC repository: +## Phases -```console -$ helm fetch https://helm.ngc.nvidia.com/nvidia/charts/gpu-operator-.tgz -``` +| Phase | Summary | Reference | +|-------|---------|-----------| +| Concepts | Why the Operator needs internet, the four supported connectivity use cases (HTTP-proxy full/limited, full air-gapped w/ and w/o proxy), DNS requirements, the OpenShift-disconnected pointer, and fetching the base `values.yaml`. | [references/concepts.md](references/concepts.md) | +| Local image registry | Pull NVIDIA images, tag and push them to your local registry (note the driver-image OS-suffix caveat), and the full per-component `values.yaml` repository/version/imagePullSecrets sample. | [references/local-image-registry.md](references/local-image-registry.md) | +| Local package repository | Mirror the required OS packages (Ubuntu/CentOS/RHEL lists), build repo-list files, create the `repo-config` (and optional `cert-config`) ConfigMaps, and wire them via `driver.repoConfig` / `driver.certConfig`. Not needed with precompiled drivers. | [references/local-package-repository.md](references/local-package-repository.md) | +| Deploy | Fetch the chart `.tgz` and `helm install` with the customized `values.yaml`; confirm pods are running. | [references/deploy.md](references/deploy.md) | -Install the GPU Operator with the customized `values.yaml`: +## Hard rules (apply across all phases) -```console -$ helm install --wait gpu-operator \ - -n gpu-operator --create-namespace \ - gpu-operator-.tgz \ - -f values.yaml -``` +- In a full air-gapped cluster, every image must be hosted in a local registry reachable by all nodes. +- The driver image version must be suffixed with the node OS (e.g., `${recommended}-ubuntu20.04`). +- A local package repository is not required if the cluster uses precompiled drivers (see the `gpu-operator-precompiled-drivers` skill). +- For self-signed HTTPS repositories, supply a `cert-config` ConfigMap; cert format/suffix is OS-dependent. +- Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). Never hardcode a version. -Check the status of the pods to ensure all the containers are running: +## Verification -```console -$ kubectl get pods -n gpu-operator -``` +After deploying, run `kubectl get pods -n gpu-operator` and confirm all +containers are `Running`. Exact commands are in +[references/deploy.md](references/deploy.md). diff --git a/gpu-operator/.agents/skills/gpu-operator-install-airgapped-environments/references/concepts.md b/gpu-operator/.agents/skills/gpu-operator-install-airgapped-environments/references/concepts.md new file mode 100644 index 000000000..5d2cf461f --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-install-airgapped-environments/references/concepts.md @@ -0,0 +1,62 @@ + + + +# About Air-Gapped Installations + +This page describes how to successfully deploy the GPU Operator in clusters with restricted internet access. +By default, The GPU Operator requires internet access for the following reasons: + + 1) Container images need to be pulled during GPU Operator installation. + 2) The `driver` container needs to download several OS packages prior to driver installation. + + > [!TIP] + > Using precompiled-drivers removes the need for the `driver` containers to + > download operating system packages and removes the need to create a local package repository. + > To address these requirements, it may be necessary to create a local image registry and/or a local package repository + > so that the necessary images and packages are available for your cluster. In subsequent sections, we detail how to + > configure the GPU Operator to use local image registries and local package repositories. If your cluster is behind + > a proxy, also follow the steps from install-gpu-operator-proxy. + +Different steps are required for different environments with varying levels of internet connectivity. +The supported use cases/environments are listed in the below table: + ++--------------------------+-----------------------------------------+ + Network Flow | ++--------------------------+--------------------+--------------------+ + Use Case Pulling Images Pulling Packages ++========+=================+====================+====================+ + **1** HTTP Proxy with K8s node --> HTTP Driver container | + full Internet Proxy --> Internet --> HTTP Proxy --> | + access Image Registry Internet Package | + Repository | ++--------+-----------------+--------------------+--------------------+ + **2** HTTP Proxy with K8s node --> HTTP Driver container | + limited Internet Proxy --> Internet --> HTTP Proxy --> | + access Image Registry Local Package | + Repository | ++--------+-----------------+--------------------+--------------------+ + **3a** Full Air-Gapped K8s node --> Local Driver container | + (w/ HTTP Proxy) Image Registry --> HTTP Proxy --> | + Local Package | + Repository | ++--------+-----------------+--------------------+--------------------+ + **3b** Full Air-Gapped K8s node --> Local Driver container-->| + (w/o HTTP Proxy) Image Registry Local Package | + Repository | ++--------+-----------------+--------------------+--------------------+ + +> [!NOTE] +> For Red Hat Openshift deployments in air-gapped environments (use cases 2, 3a and 3b), +> refer to [Mirror GPU Operator images for disconnected OpenShift](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/mirror-gpu-ocp-disconnected.html). +> [!NOTE] +> Ensure that Kubernetes nodes can successfully reach the local DNS server(s). +> Public name resolution for image registry and package repositories are +> mandatory for use cases 1 and 2. +> Before proceeding to the next sections, get the `values.yaml` file used for GPU Operator configuration. + +```console +$ curl -sO https://raw.githubusercontent.com/NVIDIA/gpu-operator/v1.7.0/deployments/gpu-operator/values.yaml +``` + +> [!NOTE] +> Replace `v1.7.0` in the above command with the version you want to use. diff --git a/gpu-operator/.agents/skills/gpu-operator-install-airgapped-environments/references/deploy.md b/gpu-operator/.agents/skills/gpu-operator-install-airgapped-environments/references/deploy.md new file mode 100644 index 000000000..2df8773c9 --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-install-airgapped-environments/references/deploy.md @@ -0,0 +1,30 @@ + + + +# Deploy GPU Operator + +> [!NOTE] +> Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). + +Download and deploy GPU Operator Helm Chart with the updated `values.yaml`. + +Fetch the chart from the NGC repository: + +```console +$ helm fetch https://helm.ngc.nvidia.com/nvidia/charts/gpu-operator-.tgz +``` + +Install the GPU Operator with the customized `values.yaml`: + +```console +$ helm install --wait gpu-operator \ + -n gpu-operator --create-namespace \ + gpu-operator-.tgz \ + -f values.yaml +``` + +Check the status of the pods to ensure all the containers are running: + +```console +$ kubectl get pods -n gpu-operator +``` diff --git a/gpu-operator/.agents/skills/gpu-operator-install-airgapped-environments/references/local-image-registry.md b/gpu-operator/.agents/skills/gpu-operator-install-airgapped-environments/references/local-image-registry.md new file mode 100644 index 000000000..4c349401b --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-install-airgapped-environments/references/local-image-registry.md @@ -0,0 +1,143 @@ + + + +# Local Image Registry + +Without internet access, the GPU Operator requires all images to be hosted in a local image registry that is accessible +to all nodes in the cluster. To allow the GPU Operator to work with a local registry, users can specify local +repository, image, tag along with pull secrets in `values.yaml`. + +To pull the correct images from the NVIDIA registry, you can leverage the fields `repository`, `image` and `version` +specified in the file `values.yaml`. + +The general syntax for the container image is `/:`. + +If the version is not specified, you can retrieve the information from the NVIDIA NGC catalog at https://catalog.ngc.nvidia.com/containers. +Search for an image, such as `gpu-operator` and then check the available tags for the image. + +> [!NOTE] +> Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). + +An example is shown below with the Operator container image: + +```yaml +operator: + repository: nvcr.io/nvidia + image: gpu-operator + version: "" +``` + +For instance, to pull the gpu-operator image version , use the following instruction: + +```console +$ docker pull nvcr.io/nvidia/gpu-operator: +``` + +There is one caveat with regards to the driver image. The version field must be appended by the OS name running on the worker node. + +```yaml +driver: + repository: nvcr.io/nvidia + image: driver + version: "${recommended}" +``` + +To pull the driver image for Ubuntu 20.04: + +```console +$ docker pull nvcr.io/nvidia/driver:${recommended}-ubuntu20.04 +``` + +To push the images to the local registry, simply tag the pulled images by prefixing the image with the image registry information. + +Using the above examples, this will result in: + +```console +$ docker tag nvcr.io/nvidia/gpu-operator: //gpu-operator: +$ docker tag nvcr.io/nvidia/driver:${recommended}-ubuntu20.04 //driver:${recommended}-ubuntu20.04 +``` + +Finally, push the images to the local registry: + +```console +$ docker push //gpu-operator: +$ docker push //driver:${recommended}-ubuntu20.04 +``` + +Update `values.yaml` with local registry information in the repository field. + +> [!NOTE] +> Replace below with your local image registry URL and port. +> Sample of `values.yaml` for GPU Operator v1.9.0: + +```yaml +operator: + repository: + image: gpu-operator + version: 1.9.0 + imagePullSecrets: [] + initContainer: + image: cuda + repository: + version: 11.4.2-base-ubi8 + + validator: + image: gpu-operator-validator + repository: + version: 1.9.0 + imagePullSecrets: [] + + driver: + repository: + image: driver + version: "470.82.01" + imagePullSecrets: [] + manager: + image: k8s-driver-manager + repository: + version: v0.2.0 + + toolkit: + repository: + image: container-toolkit + version: 1.7.2-ubuntu18.04 + imagePullSecrets: [] + + devicePlugin: + repository: + image: k8s-device-plugin + version: v0.10.0-ubi8 + imagePullSecrets: [] + + dcgmExporter: + repository: + image: dcgm-exporter + version: 2.3.1-2.6.0-ubuntu20.04 + imagePullSecrets: [] + + gfd: + repository: + image: gpu-feature-discovery + version: v0.4.1 + imagePullSecrets: [] + + nodeStatusExporter: + enabled: false + repository: + image: gpu-operator-validator + version: "1.9.0" + + migManager: + enabled: true + repository: + image: k8s-mig-manager + version: v0.2.0-ubuntu20.04 + + node-feature-discovery: + image: + repository: + pullPolicy: IfNotPresent + # tag, if defined will use the given image tag, else Chart.AppVersion will be used + # tag: + imagePullSecrets: [] +``` diff --git a/gpu-operator/.agents/skills/gpu-operator-install-airgapped-environments/references/local-package-repository.md b/gpu-operator/.agents/skills/gpu-operator-install-airgapped-environments/references/local-package-repository.md new file mode 100644 index 000000000..a6f132688 --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-install-airgapped-environments/references/local-package-repository.md @@ -0,0 +1,124 @@ + + + +# Local Package Repository + +The `driver` container deployed as part of the GPU Operator requires certain packages to be available as part of the +driver installation. In restricted internet access or air-gapped installations, users are required to create a +local mirror repository for their OS distribution and make the following packages available: + +> [!NOTE] +> KERNEL_VERSION is the underlying running kernel version on the GPU node +> GCC_VERSION is the gcc version matching the one used for building underlying kernel + +Configuring a local package repository is not necessary for clusters that +can run precompiled-drivers. + +## Required Packages + +```yaml +ubuntu: + linux-headers-${KERNEL_VERSION} + linux-image-${KERNEL_VERSION} + linux-modules-${KERNEL_VERSION} + +centos: + elfutils-libelf.x86_64 + elfutils-libelf-devel.x86_64 + kernel-headers-${KERNEL_VERSION} + kernel-devel-${KERNEL_VERSION} + kernel-core-${KERNEL_VERSION} + gcc-${GCC_VERSION} + +rhel/rhcos: + kernel-headers-${KERNEL_VERSION} + kernel-devel-${KERNEL_VERSION} + kernel-core-${KERNEL_VERSION} + gcc-${GCC_VERSION} +``` + +For example, for Ubuntu, these packages can be found at `archive.ubuntu.com`. +This is the mirror to be replicate locally for your cluster. +You can use `apt-mirror` to mirror these packages to your local package repository server. + +For CentOS, `reposync` can be used to create the local mirror. + +After all the required packages are mirrored to the local repository, repo lists need to be created following +distribution specific documentation. A `ConfigMap` containing the repo list file needs to be created in +the namespace where the GPU Operator gets deployed. + +An example of repo list is shown below for Ubuntu 22.04 (access to local package repository via HTTP): + +`custom-repo.list`: + +```text +deb [arch=amd64] http:///ubuntu/mirror/archive.ubuntu.com/ubuntu jammy main universe +deb [arch=amd64] http:///ubuntu/mirror/archive.ubuntu.com/ubuntu jammy-updates main universe +deb [arch=amd64] http:///ubuntu/mirror/archive.ubuntu.com/ubuntu jammy-security main universe +``` + +An example of repo list is shown below for Ubuntu 20.04 (access to local package repository via HTTP): + +`custom-repo.list`: + +```text +deb [arch=amd64] http:///ubuntu/mirror/archive.ubuntu.com/ubuntu focal main universe +deb [arch=amd64] http:///ubuntu/mirror/archive.ubuntu.com/ubuntu focal-updates main universe +deb [arch=amd64] http:///ubuntu/mirror/archive.ubuntu.com/ubuntu focal-security main universe +``` + +An example of repo list is shown below for CentOS 8 (access to local package repository via HTTP): + +`custom-repo.repo`: + +```text +[baseos] +name=CentOS Linux $releasever - BaseOS +baseurl=http:///repos/centos/$releasever/$basearch/os/baseos/ +gpgcheck=0 +enabled=1 + +[appstream] +name=CentOS Linux $releasever - AppStream +baseurl=http:///repos/centos/$releasever/$basearch/os/appstream/ +gpgcheck=0 +enabled=1 + +[extras] +name=CentOS Linux $releasever - Extras +baseurl=http:///repos/centos/$releasever/$basearch/os/extras/ +gpgcheck=0 +enabled=1 +``` + +Create a `ConfigMap` object from the file: + +```console +$ kubectl create configmap repo-config -n gpu-operator --from-file= +``` + +Update the `custom-repo.list` file and config map as appropriate if the containerization software platform, such as Tanzu, upgrades the Kubernetes cluster nodes to a newer operating system version. + +After the config map is created, update `values.yaml` with this information to let the GPU Operator mount the repo configuration +within the `driver` container to pull required packages. Based on the OS distribution the GPU Operator automatically mounts this config map into the appropriate directory. + +```yaml +driver: + repoConfig: + configMapName: repo-config +``` + +If self-signed certificates are used for an HTTPS based internal repository then you must add a config map for those certificates. +You then specify the config map during the GPU Operator install. +Based on the OS distribution the GPU Operator automatically mounts this config map into the appropriate directory. +Similarly, the certificate file format and suffix, such as `.crt` or `.pem`, also depends on the OS distribution. + +```console +$ kubectl create configmap cert-config -n gpu-operator --from-file= --from-file= +``` + +```yaml +driver: + certConfig: + name: cert-config +``` diff --git a/gpu-operator/.agents/skills/gpu-operator-install-nvidia-vgpu/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-install-nvidia-vgpu/SKILL.md index 24501c427..1841f4a5e 100644 --- a/gpu-operator/.agents/skills/gpu-operator-install-nvidia-vgpu/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-install-nvidia-vgpu/SKILL.md @@ -20,6 +20,12 @@ tags: # Using NVIDIA vGPU +Install the GPU Operator on NVIDIA vGPU. Because the vGPU guest driver may not +be publicly redistributed, the flow is: download the licensed vGPU software, +build and push a private vGPU driver container, configure the cluster with the +NVIDIA License System token + image pull secret, then install the Operator +pointing at your private driver image. + ## Prerequisites Before installing the GPU Operator on NVIDIA vGPU, ensure the following: @@ -39,200 +45,32 @@ Before installing the GPU Operator on NVIDIA vGPU, ensure the following: > [!NOTE] > Uploading the NVIDIA vGPU driver to a publicly available repository or otherwise publicly sharing the driver is a violation of the NVIDIA vGPU EULA. -## About Installing the Operator and NVIDIA vGPU - -NVIDIA Virtual GPU (vGPU) enables multiple virtual machines (VMs) to have simultaneous, -direct access to a single physical GPU, using the same NVIDIA graphics drivers that are deployed on non-virtualized operating systems. - -The installation steps assume `gpu-operator` as the default namespace for installing the NVIDIA GPU Operator. -In case of Red Hat OpenShift Container Platform, the default namespace is `nvidia-gpu-operator`. -Change the namespace shown in the commands accordingly based on your cluster configuration. -Also replace `kubectl` in the following commands with `oc` when running on Red Hat OpenShift. - -NVIDIA vGPU is only supported with the NVIDIA License System. - -## Platform Support - -For information about the supported platforms, refer to Supported Deployment Options, Hypervisors, and NVIDIA vGPU Based Products. - -For Red Hat OpenShift Virtualization, refer to NVIDIA GPU Operator with OpenShift Virtualization. - -## Download vGPU Software - -Perform the following steps to download the vGPU software and the latest NVIDIA vGPU driver catalog file from the NVIDIA Licensing Portal. - -1. Log in to the NVIDIA Enterprise Application Hub at https://nvid.nvidia.com/dashboard and then click **NVIDIA LICENSING PORTAL**. -1. In the left navigation pane of the NVIDIA Licensing Portal, click **SOFTWARE DOWNLOADS**. -1. Locate **vGPU Driver Catalog** in the table of driver downloads and click **Download**. -1. Click the **PRODUCT FAMILY** menu and select **vGPU** to filter the downloads to vGPU only. -1. Locate the vGPU software for your platform in the table of software downloads and click **Download**. - -The vGPU software is packaged as a ZIP file. -Unzip the file to obtain the NVIDIA vGPU Linux guest driver. -The guest driver file name follows the pattern `NVIDIA-Linux-x86_64--grid.run`. - -## Build the Driver Container - -Perform the following steps to build and push a container image that includes the vGPU Linux guest driver. - -1. Clone the driver container repository and change directory into the repository: - - ```console - $ git clone https://github.com/NVIDIA/gpu-driver-container.git - ``` - - ```console - $ cd gpu-driver-container - ``` - -1. Copy the NVIDIA vGPU guest driver from your extracted ZIP file and the NVIDIA vGPU driver catalog file to the operating system version you want to build the driver container for: - - Copy `/\*-grid.run` and `vgpuDriverCatalog.yaml` to `ubuntu22.04/drivers/`. - - ```console - $ cp /*-grid.run ubuntu22.04/drivers/ - ``` - - ```console - $ cp vgpuDriverCatalog.yaml ubuntu22.04/drivers/ - ``` - - For Red Hat OpenShift Container Platform, use a directory that includes `rhel` in the directory name. - -1. Set environment variables for building the driver container image. - - - Specify your private registry URL: - - ```console - $ export PRIVATE_REGISTRY= - ``` - - - Specify the `OS_TAG` environment variable to identify the guest operating system name and version: - - ```console - $ export OS_TAG=ubuntu22.04 - ``` - - The value must match the guest operating system version. - For Red Hat OpenShift Container Platform, specify `rhcos4.` where `x` is the supported minor OCP version. - Refer to Supported Operating Systems and Kubernetes Platforms for the list of supported OS distributions. - - - Specify the Linux guest vGPU driver version that you downloaded from the NVIDIA Licensing Portal: - - ```console - $ export VGPU_DRIVER_VERSION=580.95.05 - ``` +## Activation - The Operator automatically selects the compatible guest driver version from the drivers bundled with the `driver` image. - If you disable the version check by specifying `--build-arg DISABLE_VGPU_VERSION_CHECK=true` when you build the driver image, - then the `VGPU_DRIVER_VERSION` value is used as default. +Do this first: identify which phase the user's request maps to in the Phases +table below, then **read the corresponding `references/.md` file before +acting**. All download steps, build commands, secret manifests, Helm `--set` +flags, and verification output live only in those reference files — do not +improvise commands from this dispatch layer. -1. Build the driver container image. +## Phases - > [!NOTE] - > Docker is the only supported container tool for building the driver container image. - > Multi-architecture builds additionally require [buildx](https://github.com/docker/buildx). +| Phase | Summary | Reference | +|-------|---------|-----------| +| Concepts | What NVIDIA vGPU is, the namespace/`oc`-vs-`kubectl` notes (incl. OpenShift defaults), the License-System-only requirement, and where to find platform support. | [references/concepts.md](references/concepts.md) | +| Download + build driver | Download the vGPU software and driver catalog from the Licensing Portal, then clone `gpu-driver-container`, stage the `*-grid.run` + catalog, set build env vars, `make build-vgpuguest-`, and push to your private registry. | [references/build-driver.md](references/build-driver.md) | +| Configure + install | Create `gridd.conf`, the `client_configuration_token.tok`, the `licensing-config` secret and the registry image-pull secret, then `helm install` pointing `driver.repository`/`driver.version`/`driver.imagePullSecrets`/`driver.licensingConfig.secretName` at your private image; verify. | [references/configure-and-install.md](references/configure-and-install.md) | - ```console - $ VGPU_GUEST_DRIVER_VERSION=${VGPU_DRIVER_VERSION} IMAGE_NAME=${PRIVATE_REGISTRY}/driver make build-vgpuguest-${OS_TAG} - ``` +## Hard rules (apply across all phases) -1. Push the driver container image to your private registry. - - 1. Log in to your private registry: - - ```console - $ sudo docker login ${PRIVATE_REGISTRY} --username= - ``` - - Enter your password when prompted. - - 1. Push the driver container image to your private registry: - - ```console - $ VGPU_GUEST_DRIVER_VERSION=${VGPU_DRIVER_VERSION} IMAGE_NAME=${PRIVATE_REGISTRY}/driver make push-vgpuguest-${OS_TAG} - ``` - -## Configure the Cluster with the vGPU License Information and the Driver Container Image - -1. Create an NVIDIA vGPU license file named `gridd.conf` with contents like the following example: - - ```text - # Description: Set Feature to be enabled - # Data type: integer - # Possible values: - # 0 => for unlicensed state - # 1 => for NVIDIA vGPU - # 2 => for NVIDIA RTX Virtual Workstation - # 4 => for NVIDIA Virtual Compute Server - FeatureType=1 - ``` - -1. Rename the client configuration token file that you downloaded to `client_configuration_token.tok` using a command like the following example: - - ```console - $ cp ~/Downloads/client_configuration_token_03-28-2023-16-16-36.tok client_configuration_token.tok - ``` - - The file must be named `client_configuration_token.tok`. - -1. Create the `gpu-operator` namespace: - - ```console - $ kubectl create namespace gpu-operator - ``` - -1. Create a secret that is named `licensing-config` using the `gridd.conf` and `client_configuration_token.tok` files: - - ```console - $ kubectl create secret generic licensing-config \ - -n gpu-operator --from-file=gridd.conf --from-file=client_configuration_token.tok - ``` - -1. Create an image pull secret in the `gpu-operator` namespace with the registry secret and private registry. - - 1. Set an environment variable with the name of the secret: - - ```console - $ export REGISTRY_SECRET_NAME=registry-secret - ``` - - 1. Create the secret: - - ```console - $ kubectl create secret docker-registry ${REGISTRY_SECRET_NAME} \ - --docker-server=${PRIVATE_REGISTRY} --docker-username= \ - --docker-password= \ - --docker-email= -n gpu-operator - ``` - - You need to specify the secret name `REGISTRY_SECRET_NAME` when you install the GPU Operator with Helm. - -## Install the Operator - -- Install the Operator: - - ```console - $ helm install --wait --generate-name \ - -n gpu-operator --create-namespace \ - nvidia/gpu-operator \ - --set driver.repository=${PRIVATE_REGISTRY} \ - --set driver.version=${VGPU_DRIVER_VERSION} \ - --set driver.imagePullSecrets={$REGISTRY_SECRET_NAME} \ - --set driver.licensingConfig.secretName=licensing-config - ``` - -The preceding command installs the Operator with the default configuration. -Refer to the [GPU Operator Helm chart options](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#common-chart-customization-options) for information about configuration options. +- NVIDIA vGPU is only supported with the NVIDIA License System (CLS or DLS). +- Publicly sharing or uploading the NVIDIA vGPU driver violates the NVIDIA vGPU EULA; build into a private registry only. +- Docker is the only supported tool for building the driver container image (multi-arch additionally needs buildx). +- The client configuration token file must be named exactly `client_configuration_token.tok`. +- On OpenShift, the default namespace is `nvidia-gpu-operator` and `kubectl` becomes `oc`. ## Verification -Confirm that the Operator installed and the vGPU driver pods are running: - -1. Confirm the Operator pods, including the vGPU driver, are running: - - ```console - $ kubectl get pods -n gpu-operator - ``` - - The `nvidia-vgpu-driver-daemonset` pods should report `Running` and the `nvidia-operator-validator` pod should report `Completed`. For general post-install validation, use the `gpu-operator-install` skill's verification steps. +Confirm the `nvidia-vgpu-driver-daemonset` pods are `Running` and +`nvidia-operator-validator` is `Completed`. Exact commands are in +[references/configure-and-install.md](references/configure-and-install.md). diff --git a/gpu-operator/.agents/skills/gpu-operator-install-nvidia-vgpu/references/build-driver.md b/gpu-operator/.agents/skills/gpu-operator-install-nvidia-vgpu/references/build-driver.md new file mode 100644 index 000000000..98c34f7e7 --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-install-nvidia-vgpu/references/build-driver.md @@ -0,0 +1,100 @@ + + + +# Download vGPU Software and Build the Driver Container + +## Download vGPU Software + +Perform the following steps to download the vGPU software and the latest NVIDIA vGPU driver catalog file from the NVIDIA Licensing Portal. + +1. Log in to the NVIDIA Enterprise Application Hub at https://nvid.nvidia.com/dashboard and then click **NVIDIA LICENSING PORTAL**. +1. In the left navigation pane of the NVIDIA Licensing Portal, click **SOFTWARE DOWNLOADS**. +1. Locate **vGPU Driver Catalog** in the table of driver downloads and click **Download**. +1. Click the **PRODUCT FAMILY** menu and select **vGPU** to filter the downloads to vGPU only. +1. Locate the vGPU software for your platform in the table of software downloads and click **Download**. + +The vGPU software is packaged as a ZIP file. +Unzip the file to obtain the NVIDIA vGPU Linux guest driver. +The guest driver file name follows the pattern `NVIDIA-Linux-x86_64--grid.run`. + +## Build the Driver Container + +Perform the following steps to build and push a container image that includes the vGPU Linux guest driver. + +1. Clone the driver container repository and change directory into the repository: + + ```console + $ git clone https://github.com/NVIDIA/gpu-driver-container.git + ``` + + ```console + $ cd gpu-driver-container + ``` + +1. Copy the NVIDIA vGPU guest driver from your extracted ZIP file and the NVIDIA vGPU driver catalog file to the operating system version you want to build the driver container for: + + Copy `/\*-grid.run` and `vgpuDriverCatalog.yaml` to `ubuntu22.04/drivers/`. + + ```console + $ cp /*-grid.run ubuntu22.04/drivers/ + ``` + + ```console + $ cp vgpuDriverCatalog.yaml ubuntu22.04/drivers/ + ``` + + For Red Hat OpenShift Container Platform, use a directory that includes `rhel` in the directory name. + +1. Set environment variables for building the driver container image. + + - Specify your private registry URL: + + ```console + $ export PRIVATE_REGISTRY= + ``` + + - Specify the `OS_TAG` environment variable to identify the guest operating system name and version: + + ```console + $ export OS_TAG=ubuntu22.04 + ``` + + The value must match the guest operating system version. + For Red Hat OpenShift Container Platform, specify `rhcos4.` where `x` is the supported minor OCP version. + Refer to Supported Operating Systems and Kubernetes Platforms for the list of supported OS distributions. + + - Specify the Linux guest vGPU driver version that you downloaded from the NVIDIA Licensing Portal: + + ```console + $ export VGPU_DRIVER_VERSION=580.95.05 + ``` + + The Operator automatically selects the compatible guest driver version from the drivers bundled with the `driver` image. + If you disable the version check by specifying `--build-arg DISABLE_VGPU_VERSION_CHECK=true` when you build the driver image, + then the `VGPU_DRIVER_VERSION` value is used as default. + +1. Build the driver container image. + + > [!NOTE] + > Docker is the only supported container tool for building the driver container image. + > Multi-architecture builds additionally require [buildx](https://github.com/docker/buildx). + + ```console + $ VGPU_GUEST_DRIVER_VERSION=${VGPU_DRIVER_VERSION} IMAGE_NAME=${PRIVATE_REGISTRY}/driver make build-vgpuguest-${OS_TAG} + ``` + +1. Push the driver container image to your private registry. + + 1. Log in to your private registry: + + ```console + $ sudo docker login ${PRIVATE_REGISTRY} --username= + ``` + + Enter your password when prompted. + + 1. Push the driver container image to your private registry: + + ```console + $ VGPU_GUEST_DRIVER_VERSION=${VGPU_DRIVER_VERSION} IMAGE_NAME=${PRIVATE_REGISTRY}/driver make push-vgpuguest-${OS_TAG} + ``` diff --git a/gpu-operator/.agents/skills/gpu-operator-install-nvidia-vgpu/references/concepts.md b/gpu-operator/.agents/skills/gpu-operator-install-nvidia-vgpu/references/concepts.md new file mode 100644 index 000000000..97e674576 --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-install-nvidia-vgpu/references/concepts.md @@ -0,0 +1,20 @@ + + + +# About Installing the Operator and NVIDIA vGPU + +NVIDIA Virtual GPU (vGPU) enables multiple virtual machines (VMs) to have simultaneous, +direct access to a single physical GPU, using the same NVIDIA graphics drivers that are deployed on non-virtualized operating systems. + +The installation steps assume `gpu-operator` as the default namespace for installing the NVIDIA GPU Operator. +In case of Red Hat OpenShift Container Platform, the default namespace is `nvidia-gpu-operator`. +Change the namespace shown in the commands accordingly based on your cluster configuration. +Also replace `kubectl` in the following commands with `oc` when running on Red Hat OpenShift. + +NVIDIA vGPU is only supported with the NVIDIA License System. + +## Platform Support + +For information about the supported platforms, refer to Supported Deployment Options, Hypervisors, and NVIDIA vGPU Based Products. + +For Red Hat OpenShift Virtualization, refer to NVIDIA GPU Operator with OpenShift Virtualization. diff --git a/gpu-operator/.agents/skills/gpu-operator-install-nvidia-vgpu/references/configure-and-install.md b/gpu-operator/.agents/skills/gpu-operator-install-nvidia-vgpu/references/configure-and-install.md new file mode 100644 index 000000000..cafc7d8e1 --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-install-nvidia-vgpu/references/configure-and-install.md @@ -0,0 +1,88 @@ + + + +# Configure the Cluster and Install the Operator + +## Configure the Cluster with the vGPU License Information and the Driver Container Image + +1. Create an NVIDIA vGPU license file named `gridd.conf` with contents like the following example: + + ```text + # Description: Set Feature to be enabled + # Data type: integer + # Possible values: + # 0 => for unlicensed state + # 1 => for NVIDIA vGPU + # 2 => for NVIDIA RTX Virtual Workstation + # 4 => for NVIDIA Virtual Compute Server + FeatureType=1 + ``` + +1. Rename the client configuration token file that you downloaded to `client_configuration_token.tok` using a command like the following example: + + ```console + $ cp ~/Downloads/client_configuration_token_03-28-2023-16-16-36.tok client_configuration_token.tok + ``` + + The file must be named `client_configuration_token.tok`. + +1. Create the `gpu-operator` namespace: + + ```console + $ kubectl create namespace gpu-operator + ``` + +1. Create a secret that is named `licensing-config` using the `gridd.conf` and `client_configuration_token.tok` files: + + ```console + $ kubectl create secret generic licensing-config \ + -n gpu-operator --from-file=gridd.conf --from-file=client_configuration_token.tok + ``` + +1. Create an image pull secret in the `gpu-operator` namespace with the registry secret and private registry. + + 1. Set an environment variable with the name of the secret: + + ```console + $ export REGISTRY_SECRET_NAME=registry-secret + ``` + + 1. Create the secret: + + ```console + $ kubectl create secret docker-registry ${REGISTRY_SECRET_NAME} \ + --docker-server=${PRIVATE_REGISTRY} --docker-username= \ + --docker-password= \ + --docker-email= -n gpu-operator + ``` + + You need to specify the secret name `REGISTRY_SECRET_NAME` when you install the GPU Operator with Helm. + +## Install the Operator + +- Install the Operator: + + ```console + $ helm install --wait --generate-name \ + -n gpu-operator --create-namespace \ + nvidia/gpu-operator \ + --set driver.repository=${PRIVATE_REGISTRY} \ + --set driver.version=${VGPU_DRIVER_VERSION} \ + --set driver.imagePullSecrets={$REGISTRY_SECRET_NAME} \ + --set driver.licensingConfig.secretName=licensing-config + ``` + +The preceding command installs the Operator with the default configuration. +Refer to the [GPU Operator Helm chart options](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#common-chart-customization-options) for information about configuration options. + +## Verification + +Confirm that the Operator installed and the vGPU driver pods are running: + +1. Confirm the Operator pods, including the vGPU driver, are running: + + ```console + $ kubectl get pods -n gpu-operator + ``` + + The `nvidia-vgpu-driver-daemonset` pods should report `Running` and the `nvidia-operator-validator` pod should report `Completed`. For general post-install validation, use the `gpu-operator-install` skill's verification steps. diff --git a/gpu-operator/.agents/skills/gpu-operator-nvidia-dra/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-nvidia-dra/SKILL.md index 5eb640ae9..881199fa9 100644 --- a/gpu-operator/.agents/skills/gpu-operator-nvidia-dra/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-nvidia-dra/SKILL.md @@ -21,14 +21,11 @@ tags: # NVIDIA DRA Driver for GPUs -Dynamic Resource Allocation (DRA) is a Kubernetes concept for flexibly requesting, configuring, and sharing specialized devices like GPUs. -DRA puts device configuration and scheduling into the hands of device vendors through drivers such as the DRA Driver for GPUs. -This page outlines how to install the NVIDIA DRA Driver for GPUs v25.12.0 and later with the NVIDIA GPU Operator. - -Before using the DRA Driver for GPUs, it is recommended that you are familiar with the following concepts: - -* [Upstream Kubernetes DRA documentation](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/). -* [DRA Driver repository documentation](https://github.com/NVIDIA/k8s-dra-driver-gpu) +Install and use the NVIDIA DRA Driver for GPUs (v25.12.0+) with the GPU +Operator. Dynamic Resource Allocation (DRA) lets workloads flexibly request, +configure, and share GPUs. The driver provides two independently-usable +resources: **GPU allocation** (a replacement for the Device Plugin's allocation) +and **ComputeDomains** (Multi-Node NVLink for GB200-class systems). ## Prerequisites @@ -41,288 +38,34 @@ For GPU allocation with the GPU Operator: - GPU Operator v25.10.0 or later with the NVIDIA Kubernetes Device Plugin disabled to avoid conflicts with the DRA Driver for GPUs. The DRA Driver requires Container Device Interface (CDI) enabled in the container runtime and NVIDIA Driver version 580 or later, both of which are default in GPU Operator v25.10.0 and later. - Label the nodes you plan to use for GPU allocation (for example, `nvidia.com/dra-kubelet-plugin=true`) and use them as node selectors in the DRA driver Helm chart. -## Overview - -With NVIDIA's DRA Driver for GPUs, your Kubernetes workload can allocate and consume the following two types of resources: - -* GPU allocation: for controlled sharing and dynamic reconfiguration of GPUs. This functionality is a replacement for the traditional GPU allocation method used by the NVIDIA Kubernetes Device Plugin. -* ComputeDomains: An abstraction for robust and secure [Multi-Node NVLink (MNNVL)](https://docs.nvidia.com/multi-node-nvlink-systems/index.html) for NVIDIA GB200 and similar systems. - -You can use the NVIDIA DRA Driver for GPUs with the NVIDIA GPU Operator to deploy and manage your GPUs and ComputeDomains. - -### Known Issues - -* There is a known issue where the NVIDIA Driver Manager is not aware of the DRA driver kubelet plugin, and will not correctly evict it on pod restarts. - You must label the nodes you plan to use with DRA GPU allocation and pass the node label in the GPU Operator Helm command in the `driver.manager.env` flag. - This enables the NVIDIA Driver Manager to evict the GPU kubelet plugin correctly on driver container upgrades. -* For A100 GPUs, the MIG manager does not automatically evict the DRA kubelet plugin during MIG configuration changes. - If the DRA kubelet plugin is deployed before a MIG change, then you must manually restart the DRA kubelet plugin. - -## Install the NVIDIA GPU Operator - -### GPU Allocation - -1. Create a node selector label on all the nodes in your cluster that support GPU allocation through DRA: - - ```console - kubectl label node $HOSTNAME nvidia.com/dra-kubelet-plugin=true - ``` - -2. Add the Helm repo: - - ```console - helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \ - && helm repo update - ``` - -> [!NOTE] -> Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). - -3. Install the GPU Operator with the NVIDIA Kubernetes Device Plugin disabled: - - ```console - helm upgrade --install gpu-operator nvidia/gpu-operator \ - --version= \ - --create-namespace \ - --namespace gpu-operator \ - --set devicePlugin.enabled=false \ - --set driver.manager.env[0].name=NODE_LABEL_FOR_GPU_POD_EVICTION \ - --set driver.manager.env[0].value="nvidia.com/dra-kubelet-plugin" - ``` - - Make sure that the value of `driver.manager.env` matches the node selector label that was used when installing the DRA driver helm chart. -### ComputeDomain - -1. Add the Helm repo: - -```console -helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \ -&& helm repo update -``` - -2. Install the GPU Operator with the device plugin disabled: - -```console -helm upgrade --install gpu-operator nvidia/gpu-operator \ - --version= \ - --create-namespace \ - --namespace gpu-operator -``` - -Refer to the [GPU Operator installation guide](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-install.html) for additional configuration options when installing the GPU Operator. - -If you are planning to use MIG devices, refer to the [NVIDIA GPU Operator MIG documentation](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-mig.html) to configure your cluster for MIG support. - -## Install DRA Driver for GPUs - -> [!NOTE] -> The `gpuResourcesEnabledOverride=true` is an additional flag that is required to fully enable GPU allocation support. -> Include it in the Helm command if you want to enable GPU allocation support. - -If you want to disable either functionality: - -* To disable GPU allocation support, include `--set resources.gpus.enabled=false` in the Helm command. -* To disable ComputeDomain support, include `--set resources.computeDomains.enabled=false` in the Helm command. -> [!NOTE] -> The `nvidiaDriverRoot` flag sets the root directory for the NVIDIA GPU driver. -> The default value is `/`, which is the typical value for drivers installed directly on the host. -> If you are using GPU Operator managed drivers (default), the drivers are installed to `/run/nvidia/driver` by default. -> If you are using [pre-installed drivers](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#pre-installed-nvidia-gpu-drivers), you can remove the `nvidiaDriverRoot` flag or set it to `/` in the command above. -### GPU Allocation - -1. Create a custom `values.yaml` file for installing the DRA driver helm chart. - - ### values.yaml file - - Specifies the node selector label for nodes that will support GPU allocation through the DRA Driver. - - ```yaml - image: - pullPolicy: IfNotPresent - kubeletPlugin: - nodeSelector: - nvidia.com/dra-kubelet-plugin: "true" - ``` - - ### GKE values.yaml file - - Google Kubernetes Engine requires some specific values to be set in the `values.yaml` file, including the driver root on the host in `nvidiaDriverRoot` as well as the node selector label for nodes that will support GPU allocation through the DRA Driver. - - ```yaml - # Specify the driver root on the host in nvidiaDriverRoot. - # "/home/kubernetes/bin/nvidia" is the default driver root on GKE. - nvidiaDriverRoot: "/home/kubernetes/bin/nvidia" - - controller: - priorityClassName: "" - affinity: null - image: - pullPolicy: IfNotPresent - kubeletPlugin: - priorityClassName: "" - tolerations: - - effect: NoSchedule - key: nvidia.com/gpu - operator: Exists - nodeSelector: - nvidia.com/dra-kubelet-plugin: "true" - ``` - -2. Add the Helm repo: - - ```console - helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \ - && helm repo update - ``` - -3. Install the DRA driver: - - ### install command - - ```console - helm upgrade -i nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \ - --version="${dra_version}" \ - --namespace nvidia-dra-driver-gpu \ - --create-namespace \ - --set nvidiaDriverRoot=/run/nvidia/driver \ - --set gpuResourcesEnabledOverride=true \ - -f values.yaml - ``` - - ### GKE install command - - ```console - helm upgrade -i nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \ - --version="${dra_version}" \ - --namespace nvidia-dra-driver-gpu \ - --create-namespace \ - --set gpuResourcesEnabledOverride=true \ - -f values.yaml - ``` - -### ComputeDomain - -1. Add the NVIDIA NGC Catalog's Helm chart repository: - - ```console - helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update - ``` - -2. Install the DRA driver. - - Example for Operator-provided GPU driver: - - ```console - helm upgrade -i nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \ - --version="${dra_version}" \ - --create-namespace \ - --namespace nvidia-dra-driver-gpu \ - --set resources.gpus.enabled=false \ - --set nvidiaDriverRoot=/run/nvidia/driver - ``` - - Example for host-provided GPU driver: - - ```console - helm upgrade -i nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \ - --version="${dra_version}" \ - --create-namespace \ - --namespace nvidia-dra-driver-gpu \ - --set resources.gpus.enabled=false - ``` - -## Validate Installation - -1. Confirm that the DRA driver components are running: - - ```console - kubectl get pods -n nvidia-dra-driver-gpu - ``` - - *Example Output* - - ```output - NAME READY STATUS RESTARTS AGE - nvidia-dra-driver-gpu-controller-67cb99d84b-5q7kj 1/1 Running 0 7m26s - nvidia-dra-driver-gpu-kubelet-plugin-h5xsn 1/1 Running 0 7m27s - ``` - -2. Verify that GPU DeviceClasses are available: - - ```console - kubectl get deviceclass - ``` - - *Example Output* - - ```output - NAME AGE - compute-domain-daemon.nvidia.com 55s - compute-domain-default-channel.nvidia.com 55s - gpu.nvidia.com 55s - mig.nvidia.com 55s - ``` - -The `compute-domain-daemon.nvidia.com` and `compute-domain-default-channel.nvidia.com` DeviceClasses are installed when ComputeDomain support is enabled. -The `gpu.nvidia.com` and `mig.nvidia.com` DeviceClasses are installed when GPU allocation support is enabled. - -Additional validation steps are available in the DRA Driver repository documentation: - -* [Validate setup for ComputeDomain allocation](https://github.com/NVIDIA/k8s-dra-driver-gpu/wiki/Validate-setup-for-ComputeDomain-allocation) -* [Validate setup for GPU allocation](https://github.com/NVIDIA/k8s-dra-driver-gpu/wiki/Validate-setup-for-GPU-allocation) - -## Enable Health Checks - -The NVIDIA DRA driver supports GPU health monitoring using the [NVIDIA Management Library (NVML)](https://developer.nvidia.com/management-library-nvml). -This feature uses NVML to check for [GPU XID errors](https://docs.nvidia.com/deploy/xid-errors/introduction.html) and determines if a GPU or MIG device is functioning properly. - -Health checking is managed by the `NVMLDeviceHealthCheck` feature gate. -This is currently an alpha feature and is disabled by default. - -When enabled, the DRA Driver for GPUs continuously monitors GPUs for XID errors and assigns health statuses: -* Healthy - GPU is functioning normally. The GPU may have a non-critical XID error but is still available for workloads. -* Unhealthy - GPU has a critical XID error and is not suitable for workloads. - -To enable GPU health monitoring, deploy the DRA driver with the NVMLDeviceHealthCheck feature gate: - -```console -helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update -helm upgrade -i nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \ - --namespace nvidia-dra-driver-gpu \ - --set gpuResourcesEnabledOverride=true \ - --set featureGates.NVMLDeviceHealthCheck=true -``` - -> [!NOTE] -> Unhealthy GPUs will not appear in the ResourceSlice list. After the device recovers and is marked healthy again, you must restart the DRA Driver for the device to be added back into the available resources pool. -> After enabling health checks, you can monitor health status in the kubelet logs. - -1. Check kubelet plugin logs. - Health status changes are logged in the kubelet plugin container. Run `kubectl get pods -n nvidia-dra-driver-gpu` and find the `nvidia-dra-driver-gpu-kubelet-plugin-` pod name. Replace `` with your actual pod name. - - ```console - kubectl logs nvidia-dra-driver-gpu-kubelet-plugin- \ - -n nvidia-dra-driver-gpu \ - -c gpus - ``` +## Activation -2. List all ResourceSlices. - View all ResourceSlices in the cluster to see which devices are available: +Do this first: identify which phase the user's request maps to in the Phases +table below, then **read the corresponding `references/.md` file before +acting**. All `kubectl`/`helm` command sequences, `values.yaml` content +(including GKE-specific values), and validation output live only in those +reference files — do not improvise commands from this dispatch layer. - ```console - kubectl get resourceslice - ``` +## Phases -3. Inspect a specific ResourceSlice. - View detailed information about a specific resource slice. Healthy devices are listed in the resource slice, while unhealthy devices are not listed: +| Phase | Summary | Reference | +|-------|---------|-----------| +| Concepts | What DRA is, the GPU-allocation vs ComputeDomain resource types, and the known issues (Driver-Manager eviction label requirement; A100/MIG manual restart). | [references/concepts.md](references/concepts.md) | +| Install the GPU Operator | Label DRA nodes, add the Helm repo, and install the Operator with the Device Plugin disabled — the GPU-allocation path adds the `driver.manager.env` eviction-label flags; the ComputeDomain path does not. | [references/install-gpu-operator.md](references/install-gpu-operator.md) | +| Install the DRA driver + validate | Create the DRA-driver `values.yaml` (standard and GKE variants), install `nvidia-dra-driver-gpu` for GPU-allocation and/or ComputeDomain (Operator- vs host-provided driver root), and validate pods + DeviceClasses. | [references/install-dra-driver.md](references/install-dra-driver.md) | +| Enable health checks | Turn on the alpha `NVMLDeviceHealthCheck` feature gate for XID-based GPU health monitoring, and inspect health via kubelet logs and ResourceSlices. | [references/health-checks.md](references/health-checks.md) | - ```console - kubectl get resourceslice -o yaml - ``` +## Hard rules (apply across all phases) -## Additional Documentation +- Disable the NVIDIA Kubernetes Device Plugin (`devicePlugin.enabled=false`) to avoid conflicts with the DRA driver's GPU allocation. +- For GPU allocation, pass the node eviction label through `driver.manager.env` and ensure it matches the DRA driver chart's `kubeletPlugin.nodeSelector` label. +- `gpuResourcesEnabledOverride=true` is required to fully enable GPU-allocation support; disable a feature with `resources.gpus.enabled=false` or `resources.computeDomains.enabled=false`. +- Set `nvidiaDriverRoot=/run/nvidia/driver` for Operator-managed drivers (GKE: `/home/kubernetes/bin/nvidia`); for host/pre-installed drivers set `/` or omit. +- Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). Never hardcode a version. -Refer to the [DRA Driver for GPUs repository](https://github.com/NVIDIA/k8s-dra-driver-gpu/wiki) for additional documentation, including +## Verification -* [Upgrade Guide](https://github.com/NVIDIA/k8s-dra-driver-gpu/wiki/Installation#upgrading) -* [Troubleshooting Guide](https://github.com/NVIDIA/k8s-dra-driver-gpu/wiki/Troubleshooting) +Confirm the `nvidia-dra-driver-gpu` controller and kubelet-plugin pods are +`Running` and the expected DeviceClasses (`gpu.nvidia.com`, `mig.nvidia.com`, +and/or the `compute-domain-*` classes) exist. Exact commands are in +[references/install-dra-driver.md](references/install-dra-driver.md). diff --git a/gpu-operator/.agents/skills/gpu-operator-nvidia-dra/references/concepts.md b/gpu-operator/.agents/skills/gpu-operator-nvidia-dra/references/concepts.md new file mode 100644 index 000000000..ae9c32d3b --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-nvidia-dra/references/concepts.md @@ -0,0 +1,37 @@ + + + +# Overview and Known Issues + +Dynamic Resource Allocation (DRA) is a Kubernetes concept for flexibly requesting, configuring, and sharing specialized devices like GPUs. +DRA puts device configuration and scheduling into the hands of device vendors through drivers such as the DRA Driver for GPUs. +This documentation outlines how to install the NVIDIA DRA Driver for GPUs v25.12.0 and later with the NVIDIA GPU Operator. + +Before using the DRA Driver for GPUs, it is recommended that you are familiar with the following concepts: + +* [Upstream Kubernetes DRA documentation](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/). +* [DRA Driver repository documentation](https://github.com/NVIDIA/k8s-dra-driver-gpu) + +## Overview + +With NVIDIA's DRA Driver for GPUs, your Kubernetes workload can allocate and consume the following two types of resources: + +* GPU allocation: for controlled sharing and dynamic reconfiguration of GPUs. This functionality is a replacement for the traditional GPU allocation method used by the NVIDIA Kubernetes Device Plugin. +* ComputeDomains: An abstraction for robust and secure [Multi-Node NVLink (MNNVL)](https://docs.nvidia.com/multi-node-nvlink-systems/index.html) for NVIDIA GB200 and similar systems. + +You can use the NVIDIA DRA Driver for GPUs with the NVIDIA GPU Operator to deploy and manage your GPUs and ComputeDomains. + +## Known Issues + +* There is a known issue where the NVIDIA Driver Manager is not aware of the DRA driver kubelet plugin, and will not correctly evict it on pod restarts. + You must label the nodes you plan to use with DRA GPU allocation and pass the node label in the GPU Operator Helm command in the `driver.manager.env` flag. + This enables the NVIDIA Driver Manager to evict the GPU kubelet plugin correctly on driver container upgrades. +* For A100 GPUs, the MIG manager does not automatically evict the DRA kubelet plugin during MIG configuration changes. + If the DRA kubelet plugin is deployed before a MIG change, then you must manually restart the DRA kubelet plugin. + +## Additional Documentation + +Refer to the [DRA Driver for GPUs repository](https://github.com/NVIDIA/k8s-dra-driver-gpu/wiki) for additional documentation, including + +* [Upgrade Guide](https://github.com/NVIDIA/k8s-dra-driver-gpu/wiki/Installation#upgrading) +* [Troubleshooting Guide](https://github.com/NVIDIA/k8s-dra-driver-gpu/wiki/Troubleshooting) diff --git a/gpu-operator/.agents/skills/gpu-operator-nvidia-dra/references/health-checks.md b/gpu-operator/.agents/skills/gpu-operator-nvidia-dra/references/health-checks.md new file mode 100644 index 000000000..4a5873511 --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-nvidia-dra/references/health-checks.md @@ -0,0 +1,51 @@ + + + +# Enable Health Checks + +The NVIDIA DRA driver supports GPU health monitoring using the [NVIDIA Management Library (NVML)](https://developer.nvidia.com/management-library-nvml). +This feature uses NVML to check for [GPU XID errors](https://docs.nvidia.com/deploy/xid-errors/introduction.html) and determines if a GPU or MIG device is functioning properly. + +Health checking is managed by the `NVMLDeviceHealthCheck` feature gate. +This is currently an alpha feature and is disabled by default. + +When enabled, the DRA Driver for GPUs continuously monitors GPUs for XID errors and assigns health statuses: +* Healthy - GPU is functioning normally. The GPU may have a non-critical XID error but is still available for workloads. +* Unhealthy - GPU has a critical XID error and is not suitable for workloads. + +To enable GPU health monitoring, deploy the DRA driver with the NVMLDeviceHealthCheck feature gate: + +```console +helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update +helm upgrade -i nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \ + --namespace nvidia-dra-driver-gpu \ + --set gpuResourcesEnabledOverride=true \ + --set featureGates.NVMLDeviceHealthCheck=true +``` + +> [!NOTE] +> Unhealthy GPUs will not appear in the ResourceSlice list. After the device recovers and is marked healthy again, you must restart the DRA Driver for the device to be added back into the available resources pool. +> After enabling health checks, you can monitor health status in the kubelet logs. + +1. Check kubelet plugin logs. + Health status changes are logged in the kubelet plugin container. Run `kubectl get pods -n nvidia-dra-driver-gpu` and find the `nvidia-dra-driver-gpu-kubelet-plugin-` pod name. Replace `` with your actual pod name. + + ```console + kubectl logs nvidia-dra-driver-gpu-kubelet-plugin- \ + -n nvidia-dra-driver-gpu \ + -c gpus + ``` + +2. List all ResourceSlices. + View all ResourceSlices in the cluster to see which devices are available: + + ```console + kubectl get resourceslice + ``` + +3. Inspect a specific ResourceSlice. + View detailed information about a specific resource slice. Healthy devices are listed in the resource slice, while unhealthy devices are not listed: + + ```console + kubectl get resourceslice -o yaml + ``` diff --git a/gpu-operator/.agents/skills/gpu-operator-nvidia-dra/references/install-dra-driver.md b/gpu-operator/.agents/skills/gpu-operator-nvidia-dra/references/install-dra-driver.md new file mode 100644 index 000000000..a8447b906 --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-nvidia-dra/references/install-dra-driver.md @@ -0,0 +1,162 @@ + + + +# Install DRA Driver for GPUs and Validate + +> [!NOTE] +> The `gpuResourcesEnabledOverride=true` is an additional flag that is required to fully enable GPU allocation support. +> Include it in the Helm command if you want to enable GPU allocation support. + +If you want to disable either functionality: + +* To disable GPU allocation support, include `--set resources.gpus.enabled=false` in the Helm command. +* To disable ComputeDomain support, include `--set resources.computeDomains.enabled=false` in the Helm command. + +> [!NOTE] +> The `nvidiaDriverRoot` flag sets the root directory for the NVIDIA GPU driver. +> The default value is `/`, which is the typical value for drivers installed directly on the host. +> If you are using GPU Operator managed drivers (default), the drivers are installed to `/run/nvidia/driver` by default. +> If you are using [pre-installed drivers](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#pre-installed-nvidia-gpu-drivers), you can remove the `nvidiaDriverRoot` flag or set it to `/` in the command above. + +## GPU Allocation + +1. Create a custom `values.yaml` file for installing the DRA driver helm chart. + + ### values.yaml file + + Specifies the node selector label for nodes that will support GPU allocation through the DRA Driver. + + ```yaml + image: + pullPolicy: IfNotPresent + kubeletPlugin: + nodeSelector: + nvidia.com/dra-kubelet-plugin: "true" + ``` + + ### GKE values.yaml file + + Google Kubernetes Engine requires some specific values to be set in the `values.yaml` file, including the driver root on the host in `nvidiaDriverRoot` as well as the node selector label for nodes that will support GPU allocation through the DRA Driver. + + ```yaml + # Specify the driver root on the host in nvidiaDriverRoot. + # "/home/kubernetes/bin/nvidia" is the default driver root on GKE. + nvidiaDriverRoot: "/home/kubernetes/bin/nvidia" + + controller: + priorityClassName: "" + affinity: null + image: + pullPolicy: IfNotPresent + kubeletPlugin: + priorityClassName: "" + tolerations: + - effect: NoSchedule + key: nvidia.com/gpu + operator: Exists + nodeSelector: + nvidia.com/dra-kubelet-plugin: "true" + ``` + +2. Add the Helm repo: + + ```console + helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \ + && helm repo update + ``` + +3. Install the DRA driver: + + ### install command + + ```console + helm upgrade -i nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \ + --version="${dra_version}" \ + --namespace nvidia-dra-driver-gpu \ + --create-namespace \ + --set nvidiaDriverRoot=/run/nvidia/driver \ + --set gpuResourcesEnabledOverride=true \ + -f values.yaml + ``` + + ### GKE install command + + ```console + helm upgrade -i nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \ + --version="${dra_version}" \ + --namespace nvidia-dra-driver-gpu \ + --create-namespace \ + --set gpuResourcesEnabledOverride=true \ + -f values.yaml + ``` + +## ComputeDomain + +1. Add the NVIDIA NGC Catalog's Helm chart repository: + + ```console + helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update + ``` + +2. Install the DRA driver. + + Example for Operator-provided GPU driver: + + ```console + helm upgrade -i nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \ + --version="${dra_version}" \ + --create-namespace \ + --namespace nvidia-dra-driver-gpu \ + --set resources.gpus.enabled=false \ + --set nvidiaDriverRoot=/run/nvidia/driver + ``` + + Example for host-provided GPU driver: + + ```console + helm upgrade -i nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \ + --version="${dra_version}" \ + --create-namespace \ + --namespace nvidia-dra-driver-gpu \ + --set resources.gpus.enabled=false + ``` + +## Validate Installation + +1. Confirm that the DRA driver components are running: + + ```console + kubectl get pods -n nvidia-dra-driver-gpu + ``` + + *Example Output* + + ```output + NAME READY STATUS RESTARTS AGE + nvidia-dra-driver-gpu-controller-67cb99d84b-5q7kj 1/1 Running 0 7m26s + nvidia-dra-driver-gpu-kubelet-plugin-h5xsn 1/1 Running 0 7m27s + ``` + +2. Verify that GPU DeviceClasses are available: + + ```console + kubectl get deviceclass + ``` + + *Example Output* + + ```output + NAME AGE + compute-domain-daemon.nvidia.com 55s + compute-domain-default-channel.nvidia.com 55s + gpu.nvidia.com 55s + mig.nvidia.com 55s + ``` + +The `compute-domain-daemon.nvidia.com` and `compute-domain-default-channel.nvidia.com` DeviceClasses are installed when ComputeDomain support is enabled. +The `gpu.nvidia.com` and `mig.nvidia.com` DeviceClasses are installed when GPU allocation support is enabled. + +Additional validation steps are available in the DRA Driver repository documentation: + +* [Validate setup for ComputeDomain allocation](https://github.com/NVIDIA/k8s-dra-driver-gpu/wiki/Validate-setup-for-ComputeDomain-allocation) +* [Validate setup for GPU allocation](https://github.com/NVIDIA/k8s-dra-driver-gpu/wiki/Validate-setup-for-GPU-allocation) diff --git a/gpu-operator/.agents/skills/gpu-operator-nvidia-dra/references/install-gpu-operator.md b/gpu-operator/.agents/skills/gpu-operator-nvidia-dra/references/install-gpu-operator.md new file mode 100644 index 000000000..31fad58c0 --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-nvidia-dra/references/install-gpu-operator.md @@ -0,0 +1,58 @@ + + + +# Install the NVIDIA GPU Operator (for DRA) + +> [!NOTE] +> Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). + +## GPU Allocation + +1. Create a node selector label on all the nodes in your cluster that support GPU allocation through DRA: + + ```console + kubectl label node $HOSTNAME nvidia.com/dra-kubelet-plugin=true + ``` + +2. Add the Helm repo: + + ```console + helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \ + && helm repo update + ``` + +3. Install the GPU Operator with the NVIDIA Kubernetes Device Plugin disabled: + + ```console + helm upgrade --install gpu-operator nvidia/gpu-operator \ + --version= \ + --create-namespace \ + --namespace gpu-operator \ + --set devicePlugin.enabled=false \ + --set driver.manager.env[0].name=NODE_LABEL_FOR_GPU_POD_EVICTION \ + --set driver.manager.env[0].value="nvidia.com/dra-kubelet-plugin" + ``` + + Make sure that the value of `driver.manager.env` matches the node selector label that was used when installing the DRA driver helm chart. + +## ComputeDomain + +1. Add the Helm repo: + +```console +helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \ +&& helm repo update +``` + +2. Install the GPU Operator with the device plugin disabled: + +```console +helm upgrade --install gpu-operator nvidia/gpu-operator \ + --version= \ + --create-namespace \ + --namespace gpu-operator +``` + +Refer to the [GPU Operator installation guide](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-install.html) for additional configuration options when installing the GPU Operator. + +If you are planning to use MIG devices, refer to the [NVIDIA GPU Operator MIG documentation](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-mig.html) to configure your cluster for MIG support. diff --git a/gpu-operator/.agents/skills/gpu-operator-precompiled-drivers/SKILL.md b/gpu-operator/.agents/skills/gpu-operator-precompiled-drivers/SKILL.md index 6f7fbeae3..a76dfcb2f 100644 --- a/gpu-operator/.agents/skills/gpu-operator-precompiled-drivers/SKILL.md +++ b/gpu-operator/.agents/skills/gpu-operator-precompiled-drivers/SKILL.md @@ -20,282 +20,45 @@ tags: # Precompiled Driver Containers +Use precompiled NVIDIA driver containers so driver pods do not download kernel +headers / compiler tooling / OS packages and do not spend compute compiling +modules at runtime — valuable for air-gapped or resource-constrained sites. This +skill covers checking image availability, enabling/disabling precompiled +support, and building a custom precompiled image when no published variant +matches your kernel. + ## Prerequisites - A running Kubernetes cluster with NVIDIA GPU worker nodes. - The `kubectl` and `helm` CLIs available on a client machine. - A supported operating system for which NVIDIA publishes precompiled driver containers. Refer to the [GPU Operator Component Matrix](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/life-cycle-policy.html#gpu-operator-component-matrix) for supported operating systems. -## About Precompiled Driver Containers - -Containers with precompiled drivers do not require internet access to download Linux kernel -header files, GCC compiler tooling, or operating system packages. - -Using precompiled drivers also avoids the burst of compute demand that is required -to compile the kernel drivers with the conventional driver containers. - -These two benefits are valuable to most sites, but are especially beneficial to sites -with restricted internet access or sites with resource-constrained hardware. - -### Limitations and Restrictions - -* Support for deploying the driver containers with precompiled drivers is limited to - hosts with the x86_64 architecture and operating system versions listed in the supported-precompiled-drivers table. - - For information about using precompiled drivers with OpenShift Container Platform, - refer to [GPU Operator with precompiled drivers on OpenShift](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/gpu-operator-with-precompiled-drivers.html). - -* NVIDIA supports precompiled driver containers for the most recently released long-term - servicing branch (LTSB) driver branch. - -* NVIDIA builds images for the `aws`, `azure`, `generic`, `nvidia`, and `oracle` kernel variants. - If your hosts run a different kernel variant, you can build a precompiled driver image - and use your own container registry. - -* Precompiled driver containers do not support NVIDIA vGPU or GPUDirect Storage (GDS). - -## Determining if a Precompiled Driver Container is Available - -The precompiled driver containers are named according to the following pattern: - - -- - -For example, `525-5.15.0-69-generic-ubuntu22.04`. - -Use one of the following ways to check if a driver container is available for your Linux kernel and driver branch: - -* Use a web browser to access the NVIDIA GPU Driver page of the NVIDIA GPU Cloud registry at - https://catalog.ngc.nvidia.com/orgs/nvidia/containers/driver/tags. - Use the search field to filter the tags by your operating system version. - -* Use the [NGC CLI](https://ngc.nvidia.com/setup/installers/cli) tool to list the tags for the driver container: - - ```console - $ ngc registry image info nvidia/driver - ``` - - *Example Output* - - ```output - Image Repository Information - Name: driver - Display Name: NVIDIA GPU Driver - Short Description: Provision NVIDIA GPU Driver as a Container. - Built By: NVIDIA - Publisher: NVIDIA - Multinode Support: False - Multi-Arch Support: True - Logo: https://assets.nvidiagrid.net/ngc/logos/Infrastructure.png - Labels: Multi-Arch, NVIDIA AI Enterprise Supported, Infrastructure Software, Kubernetes Infrastructure - Public: Yes - Last Updated: Apr 20, 2023 - Latest Image Size: 688.87 MB - Latest Tag: 525-5.15.0-69-generic-ubuntu22.04 - Tags: - 525-5.15.0-69-generic-ubuntu22.04 - 525-5.15.0-70-generic-ubuntu22.04 - ... - ``` - -## Enabling Precompiled Driver Container Support During Installation - -> [!NOTE] -> Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). - -Refer to the common instructions for installing the Operator with Helm at install-gpu-operator. -Specify the `--set driver.usePrecompiled=true` and `--set driver.version=` arguments like the following example command: - -```console -$ helm install --wait gpu-operator \ - -n gpu-operator --create-namespace \ - nvidia/gpu-operator \ - --version= \ - --set driver.usePrecompiled=true \ - --set driver.version="" -``` - -Specify a value like `525` for ``. -Refer to Common Chart Customization Options for information about other installation options. - -## Enabling Support After Installation - -Perform the following steps to enable support for precompiled driver containers: - -1. Enable support by modifying the cluster policy: - - ```shell - $ kubectl patch clusterpolicies.nvidia.com/cluster-policy --type='json' \ - -p='[ - {"op":"replace", "path":"/spec/driver/usePrecompiled", "value":true}, - {"op":"replace", "path":"/spec/driver/version", "value":""} - ]' - ``` - - Specify a value like `525` for ``. - - *Example Output* - - ```output - clusterpolicy.nvidia.com/cluster-policy patched - ``` - -1. Optional: Confirm that the driver daemon set pods terminate: - - ```console - $ kubectl get pods -n gpu-operator - ``` - - *Example Output* - -1. Confirm that the driver container pods are running: - - ```console - $ kubectl get pods -l app=nvidia-driver-daemonset -n gpu-operator - ``` - - *Example Output* - - Ensure that the pod names include a Linux kernel semantic version number like `5.15.0-69-generic`. - -## Disabling Support for Precompiled Driver Containers - -Perform the following steps to disable support for precompiled driver containers: - -1. Disable support by modifying the cluster policy: - - ```shell - $ kubectl patch clusterpolicies.nvidia.com/cluster-policy --type='json' \ - -p='[ - {"op": "replace", "path": "/spec/driver/usePrecompiled", "value":false}, - {"op": "replace", "path": "/spec/driver/version", "value":"550.90.07"}, - ]' - ``` - - *Example Output* - - ```output - clusterpolicy.nvidia.com/cluster-policy patched - ``` - -1. Confirm that the conventional driver container pods are running: - - ```console - $ kubectl get pods -l app=nvidia-driver-daemonset -n gpu-operator - ``` - - *Example Output* - - Ensure that the pod names do not include a Linux kernel semantic version number. - -## Building a Custom Driver Container Image - -If a precompiled driver container for your Linux kernel variant is not available, -you can perform the following steps to build and run a container image. - -> [!NOTE] -> NVIDIA provides limited support for custom driver container images. -### Prerequisites -* You have access to a private container registry, such as NVIDIA NGC Private Registry, and can push container images to the registry. -* Your build machine has access to the internet to download operating system packages. -* You know a CUDA version, such as `12.1.0`, that you want to use. - The CUDA version only specifies which base image is used to build the driver container. - The version does not have any correlation to the version of CUDA that is associated with or supported by the resulting driver container. - - One way to find a supported CUDA version for your operating system is to access the NVIDIA GPU Cloud registry - at https://catalog.ngc.nvidia.com/orgs/nvidia/containers/cuda/tags and view the tags. - Use the search field to filter the tags, such as `base-ubuntu22.04`. - The filtered results show the CUDA versions, such as `12.1.0`, `12.0.1`, `12.0.0`, and so on. -* You know the GPU driver branch, such as `525`, that you want to use. - -### Procedure -1. Clone the driver container repository and change directory into the repository: - - ```console - $ git clone https://github.com/NVIDIA/gpu-driver-container.git - ``` - - ```console - $ cd gpu-driver-container - ``` - -1. Change directory to the operating system name and version under the driver directory: - - ```console - $ cd ubuntu22.04/precompiled - ``` - -1. Set environment variables for building the driver container image. - - - Specify your private registry URL: - - ```console - $ export PRIVATE_REGISTRY= - ``` - - - Specify the `KERNEL_VERSION` environment variable that matches your kernel variant, such as `5.15.0-1033-aws`: - - ```console - $ export KERNEL_VERSION=5.15.0-1033-aws - ``` - - - Specify the version of the CUDA base image to use when building the driver container: - - ```console - $ export CUDA_VERSION=12.1.0 - ``` - - - Specify the driver branch, such as `525`: - - ```console - $ export DRIVER_BRANCH=525 - ``` - - - Specify the `OS_TAG` environment variable to identify the guest operating system name and version: - - ```console - $ export OS_TAG=ubuntu22.04 - ``` - - The value must match the guest operating system version. - -1. Build the driver container image: - - ```console - $ sudo docker build \ - --build-arg KERNEL_VERSION=$KERNEL_VERSION \ - --build-arg CUDA_VERSION=$CUDA_VERSION \ - --build-arg DRIVER_BRANCH=$DRIVER_BRANCH \ - -t ${PRIVATE_REGISTRY}/driver:${DRIVER_BRANCH}-${KERNEL_VERSION}-${OS_TAG} . - ``` - -1. Push the driver container image to your private registry. - - - Log in to your private registry: - - ```console - $ sudo docker login ${PRIVATE_REGISTRY} --username= - ``` - - Enter your password when prompted. +## Activation - - Push the driver container image to your private registry: +Do this first: identify which phase the user's request maps to in the Phases +table below, then **read the corresponding `references/.md` file before +acting**. All command sequences, image-tag patterns, Helm `--set` flags, +cluster-policy patches, and the custom-image build steps live only in those +reference files — do not improvise commands from this dispatch layer. - ```console - $ sudo docker push ${PRIVATE_REGISTRY}/driver:${DRIVER_BRANCH}-${KERNEL_VERSION}-${OS_TAG} - ``` +## Phases -### Next Steps -* To use the custom driver container image, follow the steps for enabling support during or after installation. +| Phase | Summary | Reference | +|-------|---------|-----------| +| Concepts | What precompiled driver containers are, their benefits, and the limitations/restrictions (x86_64-only, LTSB branch, supported kernel variants, no vGPU/GDS). | [references/concepts.md](references/concepts.md) | +| Availability | The `--` naming pattern and how to check (NGC web catalog or `ngc registry image info nvidia/driver`) whether an image exists for your kernel. | [references/availability.md](references/availability.md) | +| Enable / disable | Enable during install (`driver.usePrecompiled=true` + `driver.version`), enable after install (cluster-policy patch), and disable (cluster-policy patch back to a conventional driver version). | [references/enable-disable.md](references/enable-disable.md) | +| Build custom image | When no published variant matches: prerequisites, clone `gpu-driver-container`, set build env vars, `docker build`, push to a private registry, and wire it up via `driver.repository` / `driver.imagePullSecrets`. | [references/build-custom.md](references/build-custom.md) | - If you have not already installed the GPU Operator, in addition to the `--set driver.usePrecompiled=true` - and `--set driver.version=${DRIVER_BRANCH}` arguments for Helm, also specify the `--set driver.repository="$PRIVATE_REGISTRY"` argument. +## Hard rules (apply across all phases) - If the container registry is not public, you need to create an image pull secret in the GPU Operator namespace - and specify it in the `--set driver.imagePullSecrets=` argument. +- Precompiled driver containers are x86_64-only, support the most recent LTSB driver branch, and do not support NVIDIA vGPU or GPUDirect Storage (GDS). +- NVIDIA publishes images only for the `aws`, `azure`, `generic`, `nvidia`, and `oracle` kernel variants; other variants require a custom build. +- When precompiled is active, driver pod names include the kernel semantic version (e.g., `5.15.0-69-generic`); when disabled, they do not — use this to verify the mode. +- Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). Never hardcode a version. - If you already installed the GPU Operator, specify the private registry for the driver in the cluster policy: +## Verification - ```console - $ kubectl patch clusterpolicies.nvidia.com/cluster-policy --type='json' \ - -p='[{"op": "replace", "path": "/spec/driver/repository", "value":"$PRIVATE_REGISTRY"}]' - ``` +Confirm the driver daemonset pods are `Running` and that their names do (precompiled) +or do not (conventional) include a Linux kernel semantic version. Exact commands are in +[references/enable-disable.md](references/enable-disable.md). diff --git a/gpu-operator/.agents/skills/gpu-operator-precompiled-drivers/references/availability.md b/gpu-operator/.agents/skills/gpu-operator-precompiled-drivers/references/availability.md new file mode 100644 index 000000000..47731ae92 --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-precompiled-drivers/references/availability.md @@ -0,0 +1,45 @@ + + + +# Determining if a Precompiled Driver Container is Available + +The precompiled driver containers are named according to the following pattern: + + -- + +For example, `525-5.15.0-69-generic-ubuntu22.04`. + +Use one of the following ways to check if a driver container is available for your Linux kernel and driver branch: + +* Use a web browser to access the NVIDIA GPU Driver page of the NVIDIA GPU Cloud registry at + https://catalog.ngc.nvidia.com/orgs/nvidia/containers/driver/tags. + Use the search field to filter the tags by your operating system version. + +* Use the [NGC CLI](https://ngc.nvidia.com/setup/installers/cli) tool to list the tags for the driver container: + + ```console + $ ngc registry image info nvidia/driver + ``` + + *Example Output* + + ```output + Image Repository Information + Name: driver + Display Name: NVIDIA GPU Driver + Short Description: Provision NVIDIA GPU Driver as a Container. + Built By: NVIDIA + Publisher: NVIDIA + Multinode Support: False + Multi-Arch Support: True + Logo: https://assets.nvidiagrid.net/ngc/logos/Infrastructure.png + Labels: Multi-Arch, NVIDIA AI Enterprise Supported, Infrastructure Software, Kubernetes Infrastructure + Public: Yes + Last Updated: Apr 20, 2023 + Latest Image Size: 688.87 MB + Latest Tag: 525-5.15.0-69-generic-ubuntu22.04 + Tags: + 525-5.15.0-69-generic-ubuntu22.04 + 525-5.15.0-70-generic-ubuntu22.04 + ... + ``` diff --git a/gpu-operator/.agents/skills/gpu-operator-precompiled-drivers/references/build-custom.md b/gpu-operator/.agents/skills/gpu-operator-precompiled-drivers/references/build-custom.md new file mode 100644 index 000000000..f0fd6f89b --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-precompiled-drivers/references/build-custom.md @@ -0,0 +1,119 @@ + + + +# Building a Custom Driver Container Image + +If a precompiled driver container for your Linux kernel variant is not available, +you can perform the following steps to build and run a container image. + +> [!NOTE] +> NVIDIA provides limited support for custom driver container images. + +## Prerequisites + +* You have access to a private container registry, such as NVIDIA NGC Private Registry, and can push container images to the registry. +* Your build machine has access to the internet to download operating system packages. +* You know a CUDA version, such as `12.1.0`, that you want to use. + The CUDA version only specifies which base image is used to build the driver container. + The version does not have any correlation to the version of CUDA that is associated with or supported by the resulting driver container. + + One way to find a supported CUDA version for your operating system is to access the NVIDIA GPU Cloud registry + at https://catalog.ngc.nvidia.com/orgs/nvidia/containers/cuda/tags and view the tags. + Use the search field to filter the tags, such as `base-ubuntu22.04`. + The filtered results show the CUDA versions, such as `12.1.0`, `12.0.1`, `12.0.0`, and so on. +* You know the GPU driver branch, such as `525`, that you want to use. + +## Procedure + +1. Clone the driver container repository and change directory into the repository: + + ```console + $ git clone https://github.com/NVIDIA/gpu-driver-container.git + ``` + + ```console + $ cd gpu-driver-container + ``` + +1. Change directory to the operating system name and version under the driver directory: + + ```console + $ cd ubuntu22.04/precompiled + ``` + +1. Set environment variables for building the driver container image. + + - Specify your private registry URL: + + ```console + $ export PRIVATE_REGISTRY= + ``` + + - Specify the `KERNEL_VERSION` environment variable that matches your kernel variant, such as `5.15.0-1033-aws`: + + ```console + $ export KERNEL_VERSION=5.15.0-1033-aws + ``` + + - Specify the version of the CUDA base image to use when building the driver container: + + ```console + $ export CUDA_VERSION=12.1.0 + ``` + + - Specify the driver branch, such as `525`: + + ```console + $ export DRIVER_BRANCH=525 + ``` + + - Specify the `OS_TAG` environment variable to identify the guest operating system name and version: + + ```console + $ export OS_TAG=ubuntu22.04 + ``` + + The value must match the guest operating system version. + +1. Build the driver container image: + + ```console + $ sudo docker build \ + --build-arg KERNEL_VERSION=$KERNEL_VERSION \ + --build-arg CUDA_VERSION=$CUDA_VERSION \ + --build-arg DRIVER_BRANCH=$DRIVER_BRANCH \ + -t ${PRIVATE_REGISTRY}/driver:${DRIVER_BRANCH}-${KERNEL_VERSION}-${OS_TAG} . + ``` + +1. Push the driver container image to your private registry. + + - Log in to your private registry: + + ```console + $ sudo docker login ${PRIVATE_REGISTRY} --username= + ``` + + Enter your password when prompted. + + - Push the driver container image to your private registry: + + ```console + $ sudo docker push ${PRIVATE_REGISTRY}/driver:${DRIVER_BRANCH}-${KERNEL_VERSION}-${OS_TAG} + ``` + +## Next Steps + +* To use the custom driver container image, follow the steps for enabling support during or after installation. + + If you have not already installed the GPU Operator, in addition to the `--set driver.usePrecompiled=true` + and `--set driver.version=${DRIVER_BRANCH}` arguments for Helm, also specify the `--set driver.repository="$PRIVATE_REGISTRY"` argument. + + If the container registry is not public, you need to create an image pull secret in the GPU Operator namespace + and specify it in the `--set driver.imagePullSecrets=` argument. + + If you already installed the GPU Operator, specify the private registry for the driver in the cluster policy: + + ```console + $ kubectl patch clusterpolicies.nvidia.com/cluster-policy --type='json' \ + -p='[{"op": "replace", "path": "/spec/driver/repository", "value":"$PRIVATE_REGISTRY"}]' + ``` diff --git a/gpu-operator/.agents/skills/gpu-operator-precompiled-drivers/references/concepts.md b/gpu-operator/.agents/skills/gpu-operator-precompiled-drivers/references/concepts.md new file mode 100644 index 000000000..81ced5ff2 --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-precompiled-drivers/references/concepts.md @@ -0,0 +1,30 @@ + + + +# About Precompiled Driver Containers + +Containers with precompiled drivers do not require internet access to download Linux kernel +header files, GCC compiler tooling, or operating system packages. + +Using precompiled drivers also avoids the burst of compute demand that is required +to compile the kernel drivers with the conventional driver containers. + +These two benefits are valuable to most sites, but are especially beneficial to sites +with restricted internet access or sites with resource-constrained hardware. + +## Limitations and Restrictions + +* Support for deploying the driver containers with precompiled drivers is limited to + hosts with the x86_64 architecture and operating system versions listed in the supported-precompiled-drivers table. + + For information about using precompiled drivers with OpenShift Container Platform, + refer to [GPU Operator with precompiled drivers on OpenShift](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/gpu-operator-with-precompiled-drivers.html). + +* NVIDIA supports precompiled driver containers for the most recently released long-term + servicing branch (LTSB) driver branch. + +* NVIDIA builds images for the `aws`, `azure`, `generic`, `nvidia`, and `oracle` kernel variants. + If your hosts run a different kernel variant, you can build a precompiled driver image + and use your own container registry. + +* Precompiled driver containers do not support NVIDIA vGPU or GPUDirect Storage (GDS). diff --git a/gpu-operator/.agents/skills/gpu-operator-precompiled-drivers/references/enable-disable.md b/gpu-operator/.agents/skills/gpu-operator-precompiled-drivers/references/enable-disable.md new file mode 100644 index 000000000..8aace08b7 --- /dev/null +++ b/gpu-operator/.agents/skills/gpu-operator-precompiled-drivers/references/enable-disable.md @@ -0,0 +1,94 @@ + + + +# Enabling and Disabling Precompiled Driver Container Support + +## Enabling Support During Installation + +> [!NOTE] +> Replace `` with your target GPU Operator release; see the [releases page](https://github.com/NVIDIA/gpu-operator/releases). + +Refer to the common instructions for installing the Operator with Helm at install-gpu-operator. +Specify the `--set driver.usePrecompiled=true` and `--set driver.version=` arguments like the following example command: + +```console +$ helm install --wait gpu-operator \ + -n gpu-operator --create-namespace \ + nvidia/gpu-operator \ + --version= \ + --set driver.usePrecompiled=true \ + --set driver.version="" +``` + +Specify a value like `525` for ``. +Refer to Common Chart Customization Options for information about other installation options. + +## Enabling Support After Installation + +Perform the following steps to enable support for precompiled driver containers: + +1. Enable support by modifying the cluster policy: + + ```shell + $ kubectl patch clusterpolicies.nvidia.com/cluster-policy --type='json' \ + -p='[ + {"op":"replace", "path":"/spec/driver/usePrecompiled", "value":true}, + {"op":"replace", "path":"/spec/driver/version", "value":""} + ]' + ``` + + Specify a value like `525` for ``. + + *Example Output* + + ```output + clusterpolicy.nvidia.com/cluster-policy patched + ``` + +1. Optional: Confirm that the driver daemon set pods terminate: + + ```console + $ kubectl get pods -n gpu-operator + ``` + + *Example Output* + +1. Confirm that the driver container pods are running: + + ```console + $ kubectl get pods -l app=nvidia-driver-daemonset -n gpu-operator + ``` + + *Example Output* + + Ensure that the pod names include a Linux kernel semantic version number like `5.15.0-69-generic`. + +## Disabling Support for Precompiled Driver Containers + +Perform the following steps to disable support for precompiled driver containers: + +1. Disable support by modifying the cluster policy: + + ```shell + $ kubectl patch clusterpolicies.nvidia.com/cluster-policy --type='json' \ + -p='[ + {"op": "replace", "path": "/spec/driver/usePrecompiled", "value":false}, + {"op": "replace", "path": "/spec/driver/version", "value":"550.90.07"}, + ]' + ``` + + *Example Output* + + ```output + clusterpolicy.nvidia.com/cluster-policy patched + ``` + +1. Confirm that the conventional driver container pods are running: + + ```console + $ kubectl get pods -l app=nvidia-driver-daemonset -n gpu-operator + ``` + + *Example Output* + + Ensure that the pod names do not include a Linux kernel semantic version number.