Skip to content

docs(skills): initial conversion of GPU Operators skills#401

Open
miyoungc wants to merge 1 commit into
mainfrom
gpu-operator-docs-to-skills
Open

docs(skills): initial conversion of GPU Operators skills#401
miyoungc wants to merge 1 commit into
mainfrom
gpu-operator-docs-to-skills

Conversation

@miyoungc

Copy link
Copy Markdown
Collaborator

No description provided.

@chenopis chenopis left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation Review — 15 findings (6 critical)

This PR is a first-pass conversion of GPU Operator RST docs into Agent Skills (SKILL.md + reference .md) plus Sphinx meta:: blocks for skill-discoverable metadata on the existing RSTs. The Skill-format direction is great, but the conversion script that produced the new files has multiple systemic bugs that make the resulting content broken or misleading.

The findings below are organized by root cause so you can fix them in the converter once and re-run, rather than file-by-file. Where a class of issue recurs many times, I've posted one inline finding on a representative location and listed the full set in that comment.

Critical (broken content / wrong information)

  1. gpu-operator-install-ing-nvidia/SKILL.mdLost prerequisites (only 1 of 5+ items survived the conversion).
  2. gpu-operator-install-ing-nvidia/SKILL.md.. literalinclude:: content silently dropped ("Create a file with contents like the following:" then no example). Repeats in at least 11 places across 6 SKILLs.
  3. gpu-operator-nvidia-google/SKILL.mdYAML fragment leaking out of code block at file top.
  4. gpu-operator-nvidia-google/SKILL.mdMangled table (4 columns merged into 1 header row).
  5. gpu-operator-references/SKILL.mdDescription routes wrong for the umbrella references skill (claims confidential-containers only; loads 7 references).
  6. gpu-operator-install-ing-nvidia/SKILL.mdSkill name gpu-operator-install-ing-nvidia is a converter typo for "installing"; this is a top-level routing identity.

High (broken cross-refs and missing assets)

  1. Broken :external+...:doc: Sphinx role leaking as raw text. 9 instances across 7 SKILLs.
  2. pstai_ RST hyperlink target leaking in references/overview.md license table. 16 occurrences.
  3. references/overview.mdMissing image asset (graphics/nvidia-gpu-operator-image.jpg).
  4. Lost :ref:/:doc: cross-references as bare text. Pattern recurs 40+ times across SKILLs.

Medium (style/structure)

  1. references/security.md:26 — duplicated phrase.
  2. references/overview.md:43 — typo identifieis.
  3. Trigger keywords - … suffix in description frontmatter — not part of Agent Skills spec. 25 SKILLs affected.
  4. Step N: prefix on every H2 regardless of procedural intent. 23 SKILLs affected.
  5. .. note::/.. tip:: admonitions flattened to bare **Tip:** bold. All SKILLs affected.

Pre-existing (not regressions)

The deterministic style scanner reports 42 findings (latinisms e.g., via; banned marketing words simple, simply; contractions It's, let's). These were in the source RSTs — flagging here only because once the docs are converted to Skills they'll start being executed by agents, where prose-quality regressions matter more than they did in the upstream rendered HTML.

Critical issues must be resolved before merge.

Review generated with AI assistance using DORI ::pr.

@@ -0,0 +1,493 @@
---
name: "gpu-operator-install-ing-nvidia"

@chenopis chenopis May 20, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical: skill name is a converter typo (install-ing) — root_cause: weird-skill-naming

The converter appears to have split "installing" on a hyphen boundary, producing the broken identifier gpu-operator-install-ing-nvidia. This is the skill's top-level routing name: it's what agents use to address the skill — the public routing identity. (Note: no sibling SKILL.md currently references this name, so the blast radius is the broken identifier itself, not cross-references.)

Suggested rename: gpu-operator-getting-started (matches the source page getting-started.rst and the :description-agent: text in the corresponding .. meta:: block) or gpu-operator-install (shorter, matches install-gpu-operator-* sibling pages).

Whichever you pick, the directory name (gpu-operator-install-ing-nvidia/), the name: frontmatter field, and any (use the gpu-operator-install-ing-nvidia skill) references in other SKILL.md files all need to update together.


# Prerequisites

1. You have the `kubectl` and `helm` CLIs available on a client machine.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical: prerequisites silently truncated — root_cause: lost-content

This SKILL.md was converted from gpu-operator/getting-started.rst, which lists ~5 prerequisite items:

  1. kubectl and helm CLIs (kept here)
  2. ClusterPolicy / driver / OS-version constraints (lost)
  3. Container engine (CRI-O or containerd) on every node (lost)
  4. PSA pod-security.kubernetes.io/enforce=privileged labeling (lost)
  5. NFD already running and how to detect it (lost)

The converter dropped items 2–5 silently. Following this skill as written, an agent (or human) would attempt the install on a cluster missing any of those preconditions and the install would fail in confusing ways.

The original prereqs are at gpu-operator/getting-started.rst:50-90 in the source RST. Restore them in this SKILL.md.


You can perform the following steps to deploy Jupyter Notebook in your cluster:

1. Create a file, such as `tf-notebook.yaml`, with contents like the following example:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical: .. literalinclude:: content silently dropped — root_cause: dropped-literalinclude

The conversion script lost the actual example content where the source uses .. literalinclude::. Here at line 426–428: "Create a file, such as tf-notebook.yaml, with contents like the following example:" jumps straight to step 2 ("Apply the manifest") with no manifest in between. The source RST (getting-started.rst:709) had .. literalinclude:: ./manifests/input/tf-notebook.yaml. The Jupyter Notebook tutorial in this skill is now unrunnable.

This pattern recurs in at least 11 places across 6 SKILLs:

File Line Asset
gpu-operator-install-ing-nvidia/SKILL.md 426 tf-notebook.yaml
gpu-operator-multiinstance/SKILL.md 357 custom-mig-config.yaml
gpu-operator-nvidia-amazon/SKILL.md 116 cluster-config.yaml
gpu-operator-nvidia-driver/SKILL.md 179 nvd-all.yaml
gpu-operator-nvidia-driver/SKILL.md 209 nvd-driver-multiple.yaml
gpu-operator-nvidia-driver/SKILL.md 227 nvd-precompiled-all.yaml
gpu-operator-nvidia-driver/SKILL.md 254 nvd-precomiled-some.yaml
gpu-operator-nvidia-google/SKILL.md 69 gpu-operator-quota.yaml
gpu-operator-nvidia-google/SKILL.md 178 gpu-operator-quota.yaml
gpu-operator-timeslicing-gpus/SKILL.md 154, 187, 351 three time-slicing configs

Fix the converter to inline the contents of every .. literalinclude:: target as a fenced code block (with the right language tag from the original :language: option), then re-run.


* You installed and initialized the Google Cloud CLI.

- name: RUNTIME_CONFIG_SOURCE

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical: YAML fragment leaking out of code block — root_cause: escape-out-of-context-yaml

This line appears as bare text between the (lone) prerequisite at line 11 and the H1 at line 15:

- name: RUNTIME_CONFIG_SOURCE

It's a fragment from a YAML example that escaped its container during conversion. There's no surrounding code block, so it renders as a markdown list item with no context. Almost certainly the rest of the YAML (and the surrounding prose) was also lost.

Suggested change
- name: RUNTIME_CONFIG_SOURCE

Remove this stray line and check the source gpu-operator/google-gke.rst for the YAML block this fragment escaped from — odds are other content was lost in the same place.


The choice depends on the operating system and whether you prefer to have the Operator manage all the software components.

| Google Driver Installer - | Container-Optimized OS | Ubuntu with containerd | The Google driver installer manages the NVIDIA GPU Driver. NVIDIA GPU Operator manages other software components. |

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical: mangled table — root_cause: garbled-table

The table at lines 25–27 has its 4 columns collapsed into one header row. The first row reads:

| Google Driver Installer - | Container-Optimized OS | Ubuntu with containerd | The Google driver installer manages the NVIDIA GPU Driver. NVIDIA GPU Operator manages other software components. |
| --- | --- | --- | --- |
| NVIDIA Driver Manager - | Ubuntu with containerd | NVIDIA GPU Operator manages the lifecycle and upgrades of the driver and other NVIDIA software. |  |

The original RST list-table had columns: Approach, Operating System(s), Description. The converter has merged the approach name and the first OS column into one cell, then put the description in column 4 — and on the second row it dropped one cell entirely so the row has the wrong column count.

Fix the converter's list-table handling and re-emit. As-is, this is unreadable.

## CVEs

The following is a list of known CVEs in the GPU Operator or its operands.
To view any published security bulletins for NVIDIA products published security bulletins for NVIDIA products, refer to the NVIDIA product security page at https://www.nvidia.com/en-us/security/.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium: duplicated phrase

The sentence at line 26 reads:

To view any published security bulletins for NVIDIA products published security bulletins for NVIDIA products, refer to the NVIDIA product security page at https://www.nvidia.com/en-us/security/.

Suggested change
To view any published security bulletins for NVIDIA products published security bulletins for NVIDIA products, refer to the NVIDIA product security page at https://www.nvidia.com/en-us/security/.
To view any published security bulletins for NVIDIA products, refer to the NVIDIA product security page at https://www.nvidia.com/en-us/security/.

The base images used by the software might include software that is licensed under open-source licenses such as GPL.
The source code for these components is archived on the CUDA opensource [index](https://developer.download.nvidia.com/compute/cuda/opensource/).

The following table identifieis the licenses for the Operator and software components.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium: typo identifieisidentifies

Suggested change
The following table identifieis the licenses for the Operator and software components.
The following table identifies the licenses for the Operator and software components.

@@ -0,0 +1,189 @@
---
name: "gpu-operator-container-device"
description: "Explains how to configure CDI and NRI support for GPU workloads. Use when enabling CDI, configuring containerd, or troubleshooting CDI-based GPU injection. Trigger keywords - NVIDIA GPU Operator, CDI, NRI, containerd, Kubernetes."

@chenopis chenopis May 20, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium: Trigger keywords - … suffix in description — root_cause: description-trigger-suffix

The converter appends Trigger keywords - X, Y, Z. to the end of every skill's description field. This isn't part of the Agent Skills spec: the spec describes description as a focused, single-purpose summary capped at 1024 chars, with separate triggers and tags arrays for keyword-style routing.

The suffix bloats the description (e.g. gpu-operator-references/SKILL.md:3 runs to 17+ keywords), competes with the actual sentence-form description for the model's attention during routing, and duplicates information that should live under triggers: / tags:.

25 SKILLs affected (all 25 carry the suffix). Recommend dropping the Trigger keywords - … suffix from every description and instead emitting triggers: and tags: arrays in the frontmatter (the upstream RSTs already supply them via the :tags: and :keywords: fields in their .. meta:: blocks).


# NVIDIA GPU Operator with Amazon EKS

## Step 1: Approaches for Working with Amazon EKS

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium: Step N: prefix on every H2 — root_cause: awkward-step-numbering

The converter prefixes every H2 heading with Step 1:, Step 2:, … regardless of whether the H2 is procedural. Here:

## Step 1: Approaches for Working with Amazon EKS

…the section discusses two alternative approaches; it isn't "Step 1" of anything. Same pattern appears as "Step 1: About Multi-Instance GPU", "Step 1: HTTP Proxy Configuration for Openshift", "Step 1: Special Considerations for Service Meshes", etc.

23 SKILLs affected. The Step N: prefix is also inconsistent with how procedural steps are actually written inside each H2 (numbered list items 1., 2., …). Recommend dropping the auto-Step N: prefix from H2s. If a SKILL.md genuinely needs ordered top-level phases, structure them as a numbered list rather than as numbered H2s.

1) Container images need to be pulled during GPU Operator installation.
2) The `driver` container needs to download several OS packages prior to driver installation.

**Tip:**

@chenopis chenopis May 20, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium: .. note:: / .. tip:: admonitions flattened to bold — root_cause: flat-bold-admonitions

The converter renders RST admonitions as a bare bold label followed by paragraph body, e.g.:

   **Tip:**

   Using precompiled-drivers removes the need for the `driver` containers to
   download operating system packages.

Nothing visually distinguishes this from any other paragraph; the admonition's callout semantics are lost. 21 of the SKILL/reference files are affected (68 occurrences; typically 2–10 admonitions per file).

Recommend converting RST admonitions to GitHub-flavored Markdown alerts (> [!NOTE], > [!TIP], > [!WARNING], > [!IMPORTANT], > [!CAUTION]):

Suggested change
**Tip:**
> [!TIP]
> Using precompiled-drivers removes the need for the `driver` containers to
> download operating system packages.

If this skill format is not intended to render in GitHub, pick whatever admonition syntax the target Skill viewer supports — but keep the admonition class (Note / Tip / Warning) machine-readable so downstream renderers can style them.

@a-mccarthy

Copy link
Copy Markdown
Collaborator

@miyoungc thanks for opening I'm curious about maintaining these going forward. And it looks like all new pages need the meta data now?

@miyoungc

Copy link
Copy Markdown
Collaborator Author

@a-mccarthy
Hi Abi, yes you need the metadata going forward. And generating the skills is a simple command run and takes <1s. If docs are maintained for a release, skills get refreshed together when you are done with documentation of the release.

@chenopis

chenopis commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator

Review follow-up — count corrections + 2 systemic findings the first pass missed

Re-validated all 15 findings against the branch (all 15 reproduce at the cited lines). Three count corrections + two additions:

Corrections (already applied inline):

2 systemic findings missed in the first pass (same converter-class root cause):

  • Missing Prerequisites — ~18 of 25 skills have none (only 7 do). For executable skills, agents proceed without checking preconditions.
  • Missing verification/validation step — ~19 of 25 skills have none (only 6 do). No "how does the user confirm success?" — material for runnable skills.

Bonus catches: misspelled asset nvd-precomiled-some.yaml (gpu-operator-nvidia-driver); an unrendered ${version} template token (gpu-operator-install-ing-nvidia).

@chenopis chenopis requested a review from a-mccarthy June 1, 2026 10:34
@chenopis chenopis added documentation Issue/PR focused on fixing/editing/adding documentation bits skills AI agent skills labels Jun 1, 2026
chenopis added a commit that referenced this pull request Jun 1, 2026
Applies verified #401 review findings and the doc-optimize-skill quality
pass to the two highest-blast-radius skills.

gpu-operator-install (renamed from the converter typo
gpu-operator-install-ing-nvidia, finding #6/#1):
- Restored the 4 silently-dropped prerequisites from getting-started.rst
  (ClusterPolicy/OS constraints, container engine, PSA labeling, NFD
  detection) — finding #1.
- Inlined the dropped tf-notebook.yaml literalinclude — finding #2.
- Fixed 5 broken :external+ Sphinx roles to real OpenShift/CTK/Edge URLs
  — finding #7.
- Mapped lost :ref:/:doc: cross-refs to (use the X skill) or published
  doc links — finding #10.
- Fixed nvaie-tanzu_ RST link leak in the platforms table.
- Dropped the Trigger keywords suffix; added triggers:/tags: arrays
  — finding #13.
- Removed Step N: H2 prefixes — finding #14.
- Converted flat-bold admonitions to GitHub alerts — finding #15.
- Replaced 11 ${version} leaks with v26.3.1.

gpu-operator-references:
- Rewrote the mis-routed description (claimed confidential-containers
  only; loads 7 references) — finding #5; dropped Trigger suffix, added
  triggers:/tags:.
- overview.md: fixed missing image asset (#9), 16 pstai_ link leaks (#8),
  lost cross-refs (#10), :external+ocpindex role (#7), identifieis typo
  (#12).
- security.md: removed duplicated phrase (#11).
- life-cycle-policy.md / platform-support.md / release-notes.md:
  converted flat-bold admonitions to GitHub alerts (#15) and repaired
  admonition-boundary over-captures + an ocp_csp_support substitution
  leak surfaced during conversion.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Andrew Chen <andrewch@nvidia.com>
chenopis added a commit that referenced this pull request Jun 1, 2026
Applies the verified #401 systemic findings to all remaining skills:
- Dropped Trigger keywords suffix; added triggers:/tags: frontmatter
  arrays sourced from each page's .. meta:: block (#13).
- Removed Step N: H2/H3 prefixes (#14).
- Converted flat-bold admonitions to GitHub alerts (#15).
- Replaced ${version} leaks with v26.3.1.
- Fixed 4 broken :external+ Sphinx roles to real OpenShift doc URLs (#7).
- Restored 12 silently dropped .. literalinclude:: code blocks from the
  source manifests across nvidia-driver (4), timeslicing (3),
  gpudirect-rdma (2), google (2), multiinstance (1), amazon (1) (#2).
- Fixed misspelled asset nvd-precomiled-some.yaml -> nvd-precompiled-some.yaml.

Prerequisites/Verification sections and remaining cross-ref repairs land
in the following per-skill commits.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Andrew Chen <andrewch@nvidia.com>
chenopis added a commit that referenced this pull request Jun 1, 2026
… placeholder

A prior optimization pass replaced the source ${version} Sphinx
substitution with a frozen patch version (v26.3.1) across the skills,
which freezes a specific release and goes stale. The original #401
finding was a ${version} leak (raw template var in rendered output),
so the fix is a non-leaking, non-frozen value.

Replace every command/URL/image-tag occurrence of v26.3.1 introduced on
this branch with the angle-bracket placeholder <gpu-operator-version>
(matching the project's existing <supported-version> / <driver-branch>
convention), and add a brief inline note on first use per file pointing
to the GPU Operator releases page.

The version-specific factual reference data (release-notes.md changelog
heading, life-cycle-policy.md version table) is left intact — that
content genuinely describes a specific historical release and was not
introduced by the optimization pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Andrew Chen <andrewch@nvidia.com>
chenopis added a commit that referenced this pull request Jun 1, 2026
Applies verified #401 review findings and the doc-optimize-skill quality
pass to the two highest-blast-radius skills.

gpu-operator-install (renamed from the converter typo
gpu-operator-install-ing-nvidia, finding #6/#1):
- Restored the 4 silently-dropped prerequisites from getting-started.rst
  (ClusterPolicy/OS constraints, container engine, PSA labeling, NFD
  detection) — finding #1.
- Inlined the dropped tf-notebook.yaml literalinclude — finding #2.
- Fixed 5 broken :external+ Sphinx roles to real OpenShift/CTK/Edge URLs
  — finding #7.
- Mapped lost :ref:/:doc: cross-refs to (use the X skill) or published
  doc links — finding #10.
- Fixed nvaie-tanzu_ RST link leak in the platforms table.
- Dropped the Trigger keywords suffix; added triggers:/tags: arrays
  — finding #13.
- Removed Step N: H2 prefixes — finding #14.
- Converted flat-bold admonitions to GitHub alerts — finding #15.
- Replaced 11 ${version} leaks with v26.3.1.

gpu-operator-references:
- Rewrote the mis-routed description (claimed confidential-containers
  only; loads 7 references) — finding #5; dropped Trigger suffix, added
  triggers:/tags:.
- overview.md: fixed missing image asset (#9), 16 pstai_ link leaks (#8),
  lost cross-refs (#10), :external+ocpindex role (#7), identifieis typo
  (#12).
- security.md: removed duplicated phrase (#11).
- life-cycle-policy.md / platform-support.md / release-notes.md:
  converted flat-bold admonitions to GitHub alerts (#15) and repaired
  admonition-boundary over-captures + an ocp_csp_support substitution
  leak surfaced during conversion.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Andrew Chen <andrewch@nvidia.com>
chenopis added a commit that referenced this pull request Jun 1, 2026
Applies the verified #401 systemic findings to all remaining skills:
- Dropped Trigger keywords suffix; added triggers:/tags: frontmatter
  arrays sourced from each page's .. meta:: block (#13).
- Removed Step N: H2/H3 prefixes (#14).
- Converted flat-bold admonitions to GitHub alerts (#15).
- Replaced ${version} leaks with v26.3.1.
- Fixed 4 broken :external+ Sphinx roles to real OpenShift doc URLs (#7).
- Restored 12 silently dropped .. literalinclude:: code blocks from the
  source manifests across nvidia-driver (4), timeslicing (3),
  gpudirect-rdma (2), google (2), multiinstance (1), amazon (1) (#2).
- Fixed misspelled asset nvd-precomiled-some.yaml -> nvd-precompiled-some.yaml.

Prerequisites/Verification sections and remaining cross-ref repairs land
in the following per-skill commits.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Andrew Chen <andrewch@nvidia.com>
chenopis added a commit that referenced this pull request Jun 1, 2026
… placeholder

A prior optimization pass replaced the source ${version} Sphinx
substitution with a frozen patch version (v26.3.1) across the skills,
which freezes a specific release and goes stale. The original #401
finding was a ${version} leak (raw template var in rendered output),
so the fix is a non-leaking, non-frozen value.

Replace every command/URL/image-tag occurrence of v26.3.1 introduced on
this branch with the angle-bracket placeholder <gpu-operator-version>
(matching the project's existing <supported-version> / <driver-branch>
convention), and add a brief inline note on first use per file pointing
to the GPU Operator releases page.

The version-specific factual reference data (release-notes.md changelog
heading, life-cycle-policy.md version table) is left intact — that
content genuinely describes a specific historical release and was not
introduced by the optimization pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Andrew Chen <andrewch@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Issue/PR focused on fixing/editing/adding documentation bits skills AI agent skills

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants