-
Notifications
You must be signed in to change notification settings - Fork 41
Updates for coco 1.1 #408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
a-mccarthy
wants to merge
4
commits into
NVIDIA:main
Choose a base branch
from
a-mccarthy:coco-1.1
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Updates for coco 1.1 #408
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
107 changes: 107 additions & 0 deletions
107
confidential-containers/samples/kata-nvidia-gpu-values.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,107 @@ | ||
| # Example values file to enable NVIDIA GPU shims for the NVIDIA | ||
| # Confidential Containers Reference Architecture. | ||
|
|
||
| # Set to true for verbose kata-deploy and Kata runtime logging. | ||
| debug: false | ||
|
|
||
| # Disable Node Feature Discovery (NFD) deployment by kata-deploy. | ||
| # Both kata-deploy and the GPU Operator deploy NFD by default. This | ||
| # reference architecture relies on the NFD instance that the GPU Operator | ||
| # deploys and manages, so the kata-deploy NFD is turned off to avoid a | ||
| # duplicate, conflicting deployment. | ||
| nfd: | ||
| enabled: false | ||
|
|
||
| # Install the nydus snapshotter on each node alongside containerd. | ||
| # The confidential -snp and -tdx shims below use nydus to pull container | ||
| # images directly into the confidential VM (guest pull), which keeps image | ||
| # contents inside the trusted execution environment (TEE). | ||
| snapshotter: | ||
| setup: ["nydus"] | ||
|
|
||
| # Disable the chart's default hypervisor/TEE shims and opt in only to | ||
| # the NVIDIA GPU shims supported by this reference architecture. | ||
| shims: | ||
| disableAll: true | ||
|
|
||
| # Non-confidential NVIDIA GPU passthrough shim used when Confidential | ||
| # Computing mode is off on the node. The runtime class is restricted to | ||
| # nodes where the GPU Operator's Confidential Computing Manager has | ||
| # reported nvidia.com/cc.ready.state=false, so it will not schedule on | ||
| # CC-ready nodes. The empty containerd snapshotter falls back to the | ||
| # default (overlayfs); guest pull is not used for this non-confidential | ||
| # path. | ||
| qemu-nvidia-gpu: | ||
| enabled: true | ||
| supportedArches: | ||
| - amd64 | ||
| allowedHypervisorAnnotations: [] | ||
| containerd: | ||
| snapshotter: "" | ||
| runtimeClass: | ||
| # This label is automatically added by the GPU Operator. | ||
| nodeSelector: | ||
| nvidia.com/cc.ready.state: "false" | ||
|
|
||
| # Confidential NVIDIA GPU passthrough for AMD SEV-SNP nodes. | ||
| # Scheduled where the GPU Operator reports CC mode is on AND NFD | ||
| # reports SEV-SNP support. Set agent.httpsProxy / agent.noProxy if | ||
| # the guest needs a proxy to reach the registry. | ||
| qemu-nvidia-gpu-snp: | ||
| enabled: true | ||
| supportedArches: | ||
| - amd64 | ||
| allowedHypervisorAnnotations: [] | ||
| containerd: | ||
| snapshotter: "nydus" | ||
| forceGuestPull: false | ||
| crio: | ||
| guestPull: true | ||
| agent: | ||
| httpsProxy: "" | ||
| noProxy: "" | ||
| runtimeClass: | ||
| # These labels are automatically added by the GPU Operator and NFD | ||
| # respectively. | ||
| nodeSelector: | ||
| nvidia.com/cc.ready.state: "true" | ||
| amd.feature.node.kubernetes.io/snp: "true" | ||
|
|
||
| # Confidential NVIDIA GPU passthrough for Intel TDX nodes. | ||
| # Same selectors and snapshotter behavior as the SNP shim above, | ||
| # but pinned to TDX-capable hosts. | ||
| qemu-nvidia-gpu-tdx: | ||
| enabled: true | ||
| supportedArches: | ||
| - amd64 | ||
| allowedHypervisorAnnotations: [] | ||
| containerd: | ||
| snapshotter: "nydus" | ||
| forceGuestPull: false | ||
| crio: | ||
| guestPull: true | ||
| agent: | ||
| httpsProxy: "" | ||
| noProxy: "" | ||
| runtimeClass: | ||
| # These labels are automatically added by the GPU Operator and NFD | ||
| # respectively. | ||
| nodeSelector: | ||
| nvidia.com/cc.ready.state: "true" | ||
| intel.feature.node.kubernetes.io/tdx: "true" | ||
|
|
||
| # Default shim when a pod does not request a runtime class. Set to the | ||
| # non-confidential shim so pods only run in a confidential VM when | ||
| # they explicitly request the -snp or -tdx runtime class. | ||
| defaultShim: | ||
| amd64: qemu-nvidia-gpu # Can be changed to qemu-nvidia-gpu-snp or qemu-nvidia-gpu-tdx if preferred | ||
|
|
||
| # Create one Kubernetes RuntimeClass per enabled shim above | ||
| # (kata-qemu-nvidia-gpu, kata-qemu-nvidia-gpu-snp, kata-qemu-nvidia-gpu-tdx). | ||
| # createDefault: false suppresses the generic "kata" RuntimeClass since | ||
| # you should always reference a specific NVIDIA shim | ||
| # by name in pod specs. | ||
| runtimeClasses: | ||
| enabled: true | ||
| createDefault: false | ||
| defaultName: "kata" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -52,6 +52,9 @@ NVIDIA GPUs | |
| * - NVIDIA B200 | ||
| - Single-GPU, Multi-GPU | ||
|
|
||
| * - NVIDIA HGX B300 | ||
| - Single-GPU, Multi-GPU | ||
|
|
||
| * - NVIDIA RTX Pro 6000 BSE | ||
| - Single-GPU | ||
|
|
||
|
|
@@ -75,10 +78,10 @@ CPU Platforms | |
| - Operating System | ||
| - Kernel Version | ||
| * - AMD Genoa / Milan | ||
| - Ubuntu 25.10 | ||
| - Ubuntu 25.10 or 26.04 | ||
| - 6.17+ | ||
| * - Intel Emerald Rapids (ER) / Granite Rapids (GR) | ||
| - Ubuntu 25.10 | ||
| - Ubuntu 25.10 or 26.04 | ||
| - 6.17+ | ||
|
|
||
| For additional information on node configuration, refer to the `Confidential Computing Deployment Guide <https://docs.nvidia.com/cc-deployment-guide-tdx-snp.pdf>`_ for information about supported NVIDIA GPUs, such as the NVIDIA Hopper H100. | ||
|
|
@@ -88,7 +91,7 @@ The following topics in the deployment guide apply to a cloud-native environment | |
| * Hardware selection and initial hardware configuration, such as BIOS settings. | ||
| * Host operating system selection, initial configuration, and validation. | ||
|
|
||
| When following the cloud-native sections in the deployment guide linked above, use Ubuntu 25.10 as the host OS with its default kernel version and configuration. | ||
| When following the cloud-native sections in the deployment guide linked above, use Ubuntu 25.10 or 26.04 as the host OS with its default kernel version and configuration. | ||
|
|
||
| For additional resources on machine setup: | ||
|
|
||
|
|
@@ -114,15 +117,15 @@ Supported Software Components | |
| * - `QEMU <https://www.qemu.org/>`__ | ||
| - 10.1 \+ Patches | ||
| * - `Containerd <https://github.com/containerd/containerd>`__ | ||
| - 2.2.2 | ||
| - 2.2.2 or 2.3.x | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should we do a 2.2.2+ or 2.3.x ? |
||
| * - `Kubernetes <https://kubernetes.io/>`__ | ||
| - 1.32 \+ | ||
| * - `NVIDIA GPU Operator <https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html>`__ and its components. | ||
|
|
||
| Refer to the :ref:`GPU Operator Component Matrix <gpuop:operator-component-matrix>` for the list of components and versions included in each release. | ||
| - v26.3.1 and higher | ||
| - ${gpu_operator_version} and higher | ||
| * - `Kata Containers <https://katacontainers.io/>`__ | ||
| - 3.29 (installed with ``kata-deploy`` Helm chart) | ||
| - ${kata_version} (installed with ``kata-deploy`` Helm chart) | ||
| * - `Key Broker Service (KBS) protocol <https://confidentialcontainers.org/docs/attestation/>`__ | ||
| - 0.4.0 | ||
| * - `Kata Lifecycle Manager <https://github.com/kata-containers/lifecycle-manager>`__ | ||
|
|
||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for me to understand - why are we removing this here? Maybe I don't have the complete oversight right now, and this is listed somewhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we list it in the platform support page. We added it b/c there was a hard requirement for 2.2.2 being installed. In this release we now support 2.3.
Do you still think it is valuable to call this out to folks here as well as in the support matrix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be a bit confusing. yes, we list containerd in the support matrix but we also list QEMU. While QEMU comes from kata-deploy, containerd doesn't ... so users still need to ensure they have containerd running. Maybe we can collapse this into the prior point about 'A Kubernetes cluster with cluster administrator privileges" => to something like:
"A Kubernetes cluster with cluster administrator privileges using containerd on the nodes", "Refer to the ... table for supported Kubernetes and containerd versions"?
@fidencio thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep the note about containerd as we know from past experiences that people tend to just ignore when it's not that explicit.