diff --git a/confidential-containers/confidential-containers-deploy.rst b/confidential-containers/confidential-containers-deploy.rst index 872013e1c..5ad4ba6c6 100644 --- a/confidential-containers/confidential-containers-deploy.rst +++ b/confidential-containers/confidential-containers-deploy.rst @@ -175,7 +175,7 @@ Kubernetes Cluster && ./get_helm.sh -* Enable the ``KubeletPodResourcesGet`` and ``RuntimeClassInImageCriApi`` Kubelet feature gates on your cluster. +* Add the ``KubeletPodResourcesGet`` and ``RuntimeClassInImageCriApi`` feature gates to the Kubelet configuration file on each worker node (typically ``/var/lib/kubelet/config.yaml``). On Kubernetes v1.34 and later, ``KubeletPodResourcesGet`` is already enabled by default and only ``RuntimeClassInImageCriApi`` requires explicit configuration. On earlier Kubernetes versions, enable both gates. @@ -184,7 +184,14 @@ Kubernetes Cluster * ``RuntimeClassInImageCriApi``: Alpha since Kubernetes v1.29 and not enabled by default. Required to support pod deployments that use multiple snapshotters side-by-side. - Add both feature gates to your Kubelet configuration (typically ``/var/lib/kubelet/config.yaml``): + Add the feature gates to your Kubelet configuration. + For example, on the worker node: + + .. code-block:: console + + $ sudo nano /var/lib/kubelet/config.yaml + + Add the following to the file: .. code-block:: yaml @@ -211,7 +218,14 @@ Kubernetes Cluster Set ``runtimeRequestTimeout`` in your `kubelet configuration `_ to ``20m`` to align with the default Kata shim ``image_pull_timeout`` of 1200 seconds. The kubelet default is 2 minutes, which can be too short for GPU workloads. - Add or update the ``runtimeRequestTimeout`` field in your kubelet configuration (typically ``/var/lib/kubelet/config.yaml``): + Add or update the ``runtimeRequestTimeout`` field in the same Kubelet configuration file (``/var/lib/kubelet/config.yaml``). + For example: + + .. code-block:: console + + $ sudo nano /var/lib/kubelet/config.yaml + + Add or update the following in the file: .. code-block:: yaml :emphasize-lines: 3 @@ -235,6 +249,10 @@ Kubernetes Cluster Installation ============ +This section installs Kata Containers and the NVIDIA GPU Operator on your existing Kubernetes cluster. +The worker nodes listed by ``kubectl get nodes`` are the nodes already registered with your cluster when it was provisioned. +This guide does not create new nodes. + .. _coco-label-nodes: Label Nodes @@ -270,16 +288,10 @@ Label Nodes $ kubectl label node $NODE_NAME nvidia.com/gpu.workload.config=vm-passthrough - The GPU Operator uses this label to determine what software components to deploy to a node. - The ``nvidia.com/gpu.workload.config=vm-passthrough`` label specifies that the node should receive the software components to run Confidential Containers. - - A node can only run one container runtime at a time, so a labeled node runs only Confidential Container workloads and cannot run traditional GPU container workloads. - The labeling approach is useful if you want to run Confidential Containers workloads on some nodes and traditional GPU container workloads on other nodes in your cluster. - For more details on how the GPU Operator deploys components to your cluster, refer to the :ref:`GPU Operator Cluster Topology Considerations ` section in the architecture overview. - - .. tip:: + .. note:: - Skip this section if you plan to use all nodes in your cluster to run Confidential Containers and instead set ``sandboxWorkloads.defaultWorkload=vm-passthrough`` when installing the GPU Operator. + If the label is already present, ``kubectl label`` may print ``not labeled`` even though the label is set correctly. + Use the verification step below to confirm the label value. #. Verify the node label was added: @@ -295,6 +307,18 @@ Label Nodes After labeling the node, you can continue to the next steps to install Kata Containers and the NVIDIA GPU Operator. +The GPU Operator uses this label to determine what software components to deploy to a node. +The ``nvidia.com/gpu.workload.config=vm-passthrough`` label specifies that the node should receive the software components to run Confidential Containers. + +A node can only run one container runtime at a time, so a labeled node runs only Confidential Container workloads and cannot run traditional GPU container workloads. +The labeling approach is useful if you want to run Confidential Containers workloads on some nodes and traditional GPU container workloads on other nodes in your cluster. +For more details on how the GPU Operator deploys components to your cluster, refer to the :ref:`GPU Operator Cluster Topology Considerations ` section in the architecture overview. + +.. tip:: + + Skip this section if you plan to use all nodes in your cluster to run Confidential Containers and instead set ``sandboxWorkloads.defaultWorkload=vm-passthrough`` when installing the GPU Operator. + + .. _coco-install-kata-chart: Install the Kata Containers Helm Chart @@ -349,20 +373,31 @@ The minimum required version is 3.29.0. The GPU Operator will deploy and manage NFD in the next step. -#. Optional: Verify that the ``kata-deploy`` pod is running: +#. Verify that the ``kata-deploy`` pod is running: .. code-block:: console $ kubectl get pods -n kata-system | grep kata-deploy - *Example Output:* + *Example Output (install in progress):* + + .. code-block:: output + + NAME READY STATUS RESTARTS AGE + kata-deploy-b2lzs 0/1 Running 0 2m15s + + *Example Output (install complete):* .. code-block:: output NAME READY STATUS RESTARTS AGE kata-deploy-b2lzs 1/1 Running 0 6m37s -#. Optional: Verify that the ``kata-qemu-nvidia-gpu``, ``kata-qemu-nvidia-gpu-snp``, and ``kata-qemu-nvidia-gpu-tdx`` runtime classes are available: + The ``kata-deploy`` chart deploys a DaemonSet that installs Kata on each node and then keeps running. + Wait until the pod is ``Running`` and ``READY`` is ``1/1`` to confirm the Kata install finished. + +#. Verify that the ``kata-qemu-nvidia-gpu``, ``kata-qemu-nvidia-gpu-snp``, and ``kata-qemu-nvidia-gpu-tdx`` runtime classes are available. + These runtime classes are required to schedule confidential container workloads. .. code-block:: console @@ -377,11 +412,17 @@ The minimum required version is 3.29.0. kata-qemu-nvidia-gpu-snp kata-qemu-nvidia-gpu-snp 40s kata-qemu-nvidia-gpu-tdx kata-qemu-nvidia-gpu-tdx 40s + If only some runtime classes are listed, the chart is still deploying. + Wait 2–5 minutes and run the command again. + Missing runtime classes after 5 minutes indicate a Kata installation problem. + Check the ``kata-deploy`` pod logs for more details. + Before :ref:`running a sample workload `, all three runtime classes must be present. + Several runtimes are installed by the ``kata-deploy`` chart. The ``kata-qemu-nvidia-gpu`` runtime class is used with Kata Containers, in a non-Confidential Containers scenario. The ``kata-qemu-nvidia-gpu-snp`` for AMD-based systems or ``kata-qemu-nvidia-gpu-tdx`` for Intel-based systems runtime classes are used to deploy Confidential Containers workloads. -#. Optional: If you have an issue deploying the ``kata-deploy`` pod or are not seeing the expected runtime classes, get the pod name and view the logs: +#. Optional: If the ``kata-deploy`` pod is not ``Running`` and ready, or runtime classes are missing, get the pod name and view the logs: .. code-block:: console @@ -390,6 +431,18 @@ The minimum required version is 3.29.0. Replace ```` with the name of the ``kata-deploy`` pod from the first command's output. + *Example Output (successful install):* + + .. code-block:: output + + ... + Install completed + daemonset mode: waiting for SIGTERM + + If logs indicate a problem, recheck the :ref:`Prerequisites `. + Search the `Kata Containers issue `_ for similar reports. + If the issue persists, file a new issue there with ``kata-deploy`` pod logs and your environment details. + .. _coco-install-gpu-operator: Install the NVIDIA GPU Operator @@ -452,12 +505,18 @@ Install the NVIDIA GPU Operator and configure it to deploy Confidential Containe Refer to the :ref:`Common chart customization options ` in :doc:`Installing the NVIDIA GPU Operator ` for more details on the additional general configuration options you can specify when installing the GPU Operator. -#. Optional: Verify that all GPU Operator pods, especially the Confidential Computing Manager, Kata Device Plugin and VFIO Manager operands, are running: +#. Verify that all GPU Operator pods, especially the Confidential Computing Manager, Kata Device Plugin and VFIO Manager operands, are running: .. code-block:: console $ kubectl get pods -n gpu-operator + .. note:: + + The first time you run this command, you might see only a subset of pods while operands are still starting. + GPU Operator pods can take 3–5 minutes after the Helm command completes to all reach the Running state. + Re-run the command until the Confidential Computing Manager, Kata Sandbox Device Plugin, and VFIO Manager pods are Running. + *Example Output:* .. code-block:: output @@ -483,6 +542,27 @@ Install the NVIDIA GPU Operator and configure it to deploy Confidential Containe $ kubectl logs -n gpu-operator Replace ```` with the name of the GPU Operator pod from ``kubectl get pods -n gpu-operator``. + Also refer to the :doc:`NVIDIA GPU Operator troubleshooting guide ` for more details on resolving common issues. + +#. Verify that the node advertises GPU passthrough capacity: + + .. code-block:: console + + $ kubectl describe node $NODE_NAME | grep -E "nvidia.com/pgpu" + + *Example Output:* + + .. code-block:: output + + nvidia.com/pgpu: 1 + + If capacity is ``0``, a startup race may have occurred between the VFIO Manager and Kata Sandbox Device Plugin. + Restart the device plugin and wait for the rollout to finish: + + .. code-block:: console + + $ kubectl rollout restart daemonset/nvidia-kata-sandbox-device-plugin-daemonset -n gpu-operator + $ kubectl rollout status daemonset/nvidia-kata-sandbox-device-plugin-daemonset -n gpu-operator #. Optional: If you have host access to the worker node, you can perform the following validation step: @@ -501,10 +581,6 @@ Install the NVIDIA GPU Operator and configure it to deploy Confidential Containe Kernel driver in use: vfio-pci Kernel modules: nvidiafb, nouveau - .. tip:: - - If you have an issue deploying the GPU Operator, refer to the :doc:`NVIDIA GPU Operator troubleshooting guide ` for guidance on troubleshooting and resolving issues. - With Kata Containers and the GPU Operator installed, you can start using your cluster to run Confidential Containers workloads. To run a sample workload, refer to the :ref:`Run a Sample Workload ` section. @@ -519,6 +595,8 @@ For further configuration settings, refer to the following sections: Run a Sample Workload ===================== +Before running a workload, confirm that Kata Containers and the GPU Operator are both installed and that the ``kata-qemu-nvidia-gpu-snp`` and ``kata-qemu-nvidia-gpu-tdx`` runtime classes are available on the cluster. + A pod manifest for a confidential container GPU workload requires that you specify the ``kata-qemu-nvidia-gpu-snp`` runtime class for AMD-based systems or ``kata-qemu-nvidia-gpu-tdx`` for Intel-based systems. 1. Create a file, such as the following ``cuda-vectoradd-kata.yaml`` sample, specifying the appropriate runtime class for your system: @@ -618,6 +696,26 @@ A pod manifest for a confidential container GPU workload requires that you speci NAME READY STATUS RESTARTS AGE cuda-vectoradd-kata 1/1 Running 0 10s + .. note:: + + ``kubectl apply`` returns immediately and produces no further output while the pod is starting. + If the pod does not reach ``Running``, describe it to check events: + + .. code-block:: console + + $ kubectl describe pod cuda-vectoradd-kata + + If you see ``FailedCreatePodSandBox`` with ``GetPodResources failed`` or ``PodResources API Get method disabled``, the ``KubeletPodResourcesGet`` feature gate is not enabled on the worker node. + The event looks similar to the following: + + .. code-block:: output + + Warning FailedCreatePodSandBox ... kubelet Failed to create pod sandbox: ... device cold plug failed: ... GetPodResources failed for pod(cuda-vectoradd-kata) in namespace(default): ... PodResources API Get method disabled + + Refer to the Kubernetes Prerequisites section for details on how to enable the feature gate. + + If the pod stays in ``Pending`` with ``Insufficient nvidia.com/pgpu``, confirm the node is labeled with ``nvidia.com/gpu.workload.config=vm-passthrough`` and that ``nvidia.com/pgpu`` capacity is greater than zero. + 4. View the logs from the pod after the container starts: .. code-block:: console @@ -749,7 +847,6 @@ You can set this option when you install NVIDIA GPU Operator or afterward by mod When you change the mode, the manager performs the following actions: * Evicts the other GPU Operator operands from the node. - However, the manager does not drain user workloads. You must make sure that no user workloads are running on the node before you change the mode. * Changes the mode and resets the GPU. @@ -835,6 +932,14 @@ To verify that a mode change was successful, view the ``nvidia.com/cc.mode``, $ kubectl get node $NODE_NAME -o json | \ jq '.metadata.labels | with_entries(select(.key | startswith("nvidia.com/cc")))' +To disable Confidential Computing on a node and verify the transition: + +.. code-block:: console + + $ kubectl label node $NODE_NAME nvidia.com/cc.mode=off --overwrite + +Wait 1-2 minutes and recheck the labels. + *Example Output (CC mode disabled):* .. code-block:: json @@ -864,11 +969,15 @@ To verify that a mode change was successful, view the ``nvidia.com/cc.mode``, * The ``nvidia.com/cc.ready.state`` label indicates whether the node is ready to run Confidential Container workloads. It is set to ``true`` when ``cc.mode.state`` is ``on`` or ``ppcie``, and ``false`` when ``cc.mode.state`` is ``off``. + When you disable CC mode, expect ``nvidia.com/cc.ready.state`` to become ``false`` once the transition completes. + While a transition is in progress, ``nvidia.com/cc.mode.state`` may temporarily differ from ``nvidia.com/cc.mode``. + .. note:: It can take one to two minutes for GPU state transitions to complete and the labels to be updated. A mode change is complete and successful when ``nvidia.com/cc.mode`` and ``nvidia.com/cc.mode.state`` have the same value. + If labels do not converge, check Confidential Computing Manager pod logs, ensure no user workloads are running on the node, and reapply the ``nvidia.com/cc.mode`` label. .. _coco-configuration-multi-gpu-passthrough: