Init pass at restructuring CoCo TOC#385
Conversation
Documentation preview |
8a83b89 to
8fac0fb
Compare
|
|
||
| resources: | ||
| limits: | ||
| nvidia.com/GH100_H200_141GB: "1" |
There was a problem hiding this comment.
confirm this is a valid gpu name on a node
| If you need a timeout of more than 1200 seconds, you will also need to adjust Kata Agent Policy's ``image_pull_timeout`` value which controls the agent-side timeout for guest-image pull. | ||
| To do this, add the ``agent.image_pull_timeout`` kernel parameter to your shim configuration, or pass an explicit value in a pod annotation in the ``io.katacontainers.config.hypervisor.kernel_params: "..."`` annotation. | ||
|
|
||
| "nvidia.com/GH100_H200_141GB": "1" |
There was a problem hiding this comment.
confirm this output.
|
|
||
| ***************************************************** | ||
| ##################################################### | ||
| NVIDIA Confidential Containers Reference Architecture |
There was a problem hiding this comment.
I am thinking of whether it is possible to make the aspects
"Supported Features and Deployment Scenarios" and "Limitations and Restrictions" a bit more prominent. These get a bit buried in the already lengthy overview page. Maybe we can relocate these two into a different main page (or even create a separate main page)?
There was a problem hiding this comment.
I'd be in favor of doing this, but not in this PR. I have to circle back with hema, b/c i think that we can flush out at lot of these sections more and it may be a good idea to create separate pages.
manuelh-dev
left a comment
There was a problem hiding this comment.
LGTM, left just a few comments, feel free to resolve these if these don't seem immediately helpful
685c303 to
a4aa920
Compare
|
@manuelh-dev i made some more updates to this PR. Do you have time this week to review? Updates in cluding
|
mikemckiernan
left a comment
There was a problem hiding this comment.
Def a good idea to provide a streamlined, common install page. LMK what gibberish I can clarify.
| Deploy Confidential Containers | ||
| ****************************** | ||
| ######################################### | ||
| Install Guide for Confidential Containers |
There was a problem hiding this comment.
Def better than what I had and requires differentiation from the quickstart approach. I don't think the title is wrong, but I'm wondering if it can be more of a contrast to quickstart.
- Detailed Installation
- Common Installation Options (might be untrue)
- Traditional Workload Considerations
There was a problem hiding this comment.
Updated to Detailed Install Guide
| Refer to the :doc:`NVIDIA GPU Operator <gpuop:overview>` and `Kata Containers <https://katacontainers.io/docs/>`_ documentation for more information on these software components. | ||
| Refer to the `Kubernetes documentation <https://kubernetes.io/docs/home/>`_ for more information on Kubernetes cluster administration. | ||
| #. :doc:`Prerequisites <prerequisites>`. | ||
| #. :ref:`Label nodes for Confidential Containers components <coco-label-nodes>` |
There was a problem hiding this comment.
not-sure: I wonder if "Label nodes to install Confidential Containers components" could set expectations for why we're labelling nodes. Or, "Label the nodes to configure for Confidential Containers"?
There was a problem hiding this comment.
thanks for the suggestions on this section! i updated the wording here to hopefully be less clunky
| You can set the default confidential computing mode of the NVIDIA GPUs by setting the ``ccManager.defaultMode=<on|off>`` option. | ||
| The default value of ``ccManager.defaultMode`` is ``on``. | ||
| You can set this option when you install NVIDIA GPU Operator or afterward by modifying the cluster-policy instance of the ClusterPolicy object. | ||
|
|
||
| Set a node-level mode by applying the ``nvidia.com/cc.mode=<on|off|ppcie>`` label on the node. | ||
| If you set a specific mode on a node, it has higher precedence than the cluster-wide default mode. | ||
|
|
||
| When you change the mode, the manager performs the following actions: | ||
|
|
||
| * Evicts the other GPU Operator operands from the node. | ||
| However, the manager does not drain user workloads. You must make sure that no user workloads are running on the node before you change the mode. | ||
| * Changes the mode and resets the GPU. | ||
| * Reschedules the other GPU Operator operands. |
There was a problem hiding this comment.
I wonder if this info could follow the table or if it can be removed if it is redundant with the info in the sections that follow. You likely inherited some verbosity from my content.
| Complete the **Install** section (through :doc:`Run a Sample Workload <run-sample-workload>` with ``Test PASSED``) before wiring attestation into production workloads. | ||
|
|
||
| Attestation is not required for the install sample workload. | ||
| Configure attestation when workloads need secrets, encrypted container images, or authenticated registries. |
There was a problem hiding this comment.
@fitzthum do we care about authenticated registries?
should we generally formulate this more broadly? Every deployment should need attestation. Is there value in the solution when not conducting attestation?
| Attestation | ||
| *********** | ||
|
|
||
| As a :ref:`Security Engineer <coco-persona-security-engineer>`, use this page to configure and verify attestation for confidential workloads. |
There was a problem hiding this comment.
I think we should change the scope here and clearly delimit what this page does and what not. We should emphasize that attestation is required but that this is out of scope for this page, and instead describe that this page explains how to get to a basic setup of trustee and kbs-client for evaluation purposes. The workload etc. needs to be configured for attestation, so our goal is to not provide an end-to-end sample
There was a problem hiding this comment.
this has been updated. I also added a Using this documentation section to the index page that calls out right from the start that we only deal with nvidia specific info.
Signed-off-by: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com>
aadf85d to
2371029
Compare
Signed-off-by: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com>
Signed-off-by: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com>
The deployment guide has grown quite long. this is a draft attempt at splitting up the content into a more useable form.