Companion to #401: verified review fixes + optimization pass on the 25 GPU Operator skills by chenopis · Pull Request #422 · NVIDIA/cloud-native-docs

chenopis · 2026-06-01T11:01:16Z

Companion to #401 — verified review fixes + optimization pass

A companion branch to @miyoungc's #401 (GPU Operator docs → Agent Skills conversion), for side-by-side comparison: it takes the skills from #401 and applies (a) the verified review feedback posted on #401, (b) two systemic additions, and (c) an information-hiding optimization pass. The Skill-format direction is Miyoung's; this branch repairs converter-class defects and raises per-skill quality.

How to read the diff: it's large (104 files) mostly because of (c) — each procedural skill was split from one flat SKILL.md into a thin dispatch SKILL.md + a references/ directory. Compare original (flat) ↔ optimized (thin + references) per skill.

1. Verified review findings applied (from #401's review)

Critical — skill-name typo (install-ing-nvidia→gpu-operator-install, #6/#1); restored 4 dropped install prerequisites (#1); inlined 12 dropped .. literalinclude:: manifests (#2); removed a leaked YAML fragment + restored prereqs in the Google skill (#3); rebuilt the mangled Google table (#4); rewrote the references-skill description to cover all 7 references (#5).
High — mapped all 9 broken :external+…:doc: roles to real URLs (#7); replaced 16 pstai_ + the nvaie-tanzu_ link-target leaks (#8); fixed the missing overview image (#9); mapped lost :ref:/:doc: cross-refs to (use the X skill) / doc links (#10).
Medium — duplicated phrase (#11), identifieis typo (#12); moved non-spec Trigger keywords into proper triggers:/tags: frontmatter — all 25 (#13); removed auto Step N: heading prefixes — 23 (#14); converted flat-bold admonitions to GitHub alerts (#15).
Bonus — fixed the nvd-precomiled-some.yaml typo; replaced unrendered ${version} template tokens with a documented <gpu-operator-version> placeholder (+ a "substitute your target release" note) rather than a brittle hardcoded version.

2. Systemic additions (the two highest-value gaps)

Prerequisites — added ## Prerequisites to every skill that lacked one (~17/25 did), sourced from the RST where present.
Verification — added ## Verification ("how do you confirm success?") to skills lacking one (a missing verification step is high-severity for a runnable skill); existing inline checks preserved.

3. Information-hiding optimization (the headline change)

All 24 procedural skills were restructured from flat docs (hundreds of lines, every step visible up front) into a thin dispatch layer + on-demand references, following the doc-optimize-skill plan-init pattern:

Thin SKILL.md (now 40–76 lines each, all <200): frontmatter + an ## Activation "read the phase reference first" note + a ## Phases table with one-line summaries pointing to references/<phase>.md + cross-cutting rules + Prerequisites + Verification.
All step-by-step detail (command sequences, manifests, field tables, decision trees) moved into references/*.md (73 reference files across the 24 skills) — loaded on demand, not up front.
Why it matters: the agent loads a small dispatch layer and can't "skip ahead" past steps (verified by the close-eyes test); and the always-loaded context footprint drops ~4× (lines) / ~3× (tokens) vs the flat versions — which matters when the agent's context is already busy.
Plus light hygiene: restored title H1s the converter had replaced, repaired admonition-boundary over-captures, unambiguous routing via cleaned description + triggers:/tags:. NVIDIA's published-doc voice preserved.

The 25th skill (gpu-operator-references) is a pure index/pointer skill and already used a references/ layout.

Scope & method

25 SKILL.md files restructured + 73 references/*.md files created; all command blocks/manifests/tables preserved verbatim except where fixing a flagged defect.
Verified: all 24 procedural skills pass the close-eyes test; zero residual ${version}/:external+/Step N:/Trigger keywords/flat-bold-admonition occurrences; every frontmatter parses as valid YAML with a single H1; all (use the X skill) cross-refs and SKILL.md → references/*.md links resolve.

Flagged rather than guessed

:external+ctkconfiguration mapped to the Container Toolkit install-guide page (adjust if a dedicated config page exists).
Overview image points at the published _images/ URL rather than committing the binary.
Bare cross-refs inside the large historical references/release-notes.md changelog left as prose (low value, high volume); user-facing SKILL.md cross-refs all repaired.

🤖 Generated with Claude Code

Applies verified #401 review findings and the doc-optimize-skill quality pass to the two highest-blast-radius skills. gpu-operator-install (renamed from the converter typo gpu-operator-install-ing-nvidia, finding #6/#1): - Restored the 4 silently-dropped prerequisites from getting-started.rst (ClusterPolicy/OS constraints, container engine, PSA labeling, NFD detection) — finding #1. - Inlined the dropped tf-notebook.yaml literalinclude — finding #2. - Fixed 5 broken :external+ Sphinx roles to real OpenShift/CTK/Edge URLs — finding #7. - Mapped lost :ref:/:doc: cross-refs to (use the X skill) or published doc links — finding #10. - Fixed nvaie-tanzu_ RST link leak in the platforms table. - Dropped the Trigger keywords suffix; added triggers:/tags: arrays — finding #13. - Removed Step N: H2 prefixes — finding #14. - Converted flat-bold admonitions to GitHub alerts — finding #15. - Replaced 11 ${version} leaks with v26.3.1. gpu-operator-references: - Rewrote the mis-routed description (claimed confidential-containers only; loads 7 references) — finding #5; dropped Trigger suffix, added triggers:/tags:. - overview.md: fixed missing image asset (#9), 16 pstai_ link leaks (#8), lost cross-refs (#10), :external+ocpindex role (#7), identifieis typo (#12). - security.md: removed duplicated phrase (#11). - life-cycle-policy.md / platform-support.md / release-notes.md: converted flat-bold admonitions to GitHub alerts (#15) and repaired admonition-boundary over-captures + an ocp_csp_support substitution leak surfaced during conversion. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Andrew Chen <andrewch@nvidia.com>

Applies the verified #401 systemic findings to all remaining skills: - Dropped Trigger keywords suffix; added triggers:/tags: frontmatter arrays sourced from each page's .. meta:: block (#13). - Removed Step N: H2/H3 prefixes (#14). - Converted flat-bold admonitions to GitHub alerts (#15). - Replaced ${version} leaks with v26.3.1. - Fixed 4 broken :external+ Sphinx roles to real OpenShift doc URLs (#7). - Restored 12 silently dropped .. literalinclude:: code blocks from the source manifests across nvidia-driver (4), timeslicing (3), gpudirect-rdma (2), google (2), multiinstance (1), amazon (1) (#2). - Fixed misspelled asset nvd-precomiled-some.yaml -> nvd-precompiled-some.yaml. Prerequisites/Verification sections and remaining cross-ref repairs land in the following per-skill commits. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Andrew Chen <andrewch@nvidia.com>

…refs - Fixed 11 cases where a numbered procedure step was swept into a GitHub alert because the flattened RST source lost the blank-line boundary (container-device, kata-containers, multiinstance, amazon, nvidia-driver). - Mapped remaining bare-text :ref: cross-refs in vgpu, multiinstance, and government-ready SKILLs to published doc links (#10). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Andrew Chen <andrewch@nvidia.com>

Adds systemic Prerequisites and/or Verification sections to container-device, custom-driver, driver-upgrades, nvidia-driver, nvidia-azure, and service-mesh skills (#systemic). Fixed a bare getting-started cross-ref in service-mesh. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Andrew Chen <andrewch@nvidia.com>

Adds Prerequisites to outdated-kernels, airgapped, uninstalling, gpudirect-rdma, multiinstance, and timeslicing skills. Repaired the uninstalling NOTE admonition over-capture (two-bullet note + code block) and mapped timeslicing in-page section cross-refs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Andrew Chen <andrewch@nvidia.com>

- Restored title H1s that the converter replaced with a bare # Prerequisites heading, moving prerequisites to a ## section after the real title (http-proxy, vgpu, kubevirt, dra, google, upgrading). - Restored dropped prerequisite list bodies from source RST (vgpu, kubevirt, google, upgrading). - Fixed the google mangled approaches table (finding #4) and stray '- name: RUNTIME_CONFIG_SOURCE' YAML fragment leak (finding #3). - Fixed http-proxy admonition over-capture + proxy_config_openshift ref. - Added Prerequisites to gov-ready, kata, precompiled, amazon, nvaie. - Added Verification sections to nvaie, vgpu, amazon, google, upgrading, service-mesh, nvidia-driver, container-device. - Mapped remaining bare cross-refs (nvaie component-matrix/install, google/upgrading skill refs). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Andrew Chen <andrewch@nvidia.com>

… placeholder A prior optimization pass replaced the source ${version} Sphinx substitution with a frozen patch version (v26.3.1) across the skills, which freezes a specific release and goes stale. The original #401 finding was a ${version} leak (raw template var in rendered output), so the fix is a non-leaking, non-frozen value. Replace every command/URL/image-tag occurrence of v26.3.1 introduced on this branch with the angle-bracket placeholder <gpu-operator-version> (matching the project's existing <supported-version> / <driver-branch> convention), and add a brief inline note on first use per file pointing to the GPU Operator releases page. The version-specific factual reference data (release-notes.md changelog heading, life-cycle-policy.md version table) is left intact — that content genuinely describes a specific historical release and was not introduced by the optimization pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Andrew Chen <andrewch@nvidia.com>

…ormation-hiding dispatch layout Convert the 7 largest procedural skills (450-583 lines each) into a thin dispatch-layer SKILL.md (each <200 lines, all under ~55 non-frontmatter lines) plus phase-specific references/*.md files. Step-by-step command sequences, manifests, field tables, and verification output move out of SKILL.md into references/, so the dispatch layer cannot leak later-phase procedural detail (structural no-skip property; no tooling dependency). All verified-fix content from the prior optimization pass (prerequisites, verification sections, fixed cross-refs, <gpu-operator-version> placeholders) is preserved and relocated, not removed. Skills restructured (before -> after SKILL.md raw line count): - gpu-operator-kata-containers 583 -> 65 - gpu-operator-install 572 -> 71 - gpu-operator-multiinstance 560 -> 62 - gpu-operator-gpudirect-rdma 555 -> 67 - gpu-operator-kubevirt 515 -> 68 - gpu-operator-timeslicing-gpus 492 -> 62 - gpu-operator-nvidia-driver 451 -> 60 Each passes the close-your-eyes dispatch-layer test (leaked cmds, bash blocks, and line count all within thresholds). Internal SKILL.md -> references/*.md links verified to resolve. The remaining 17 procedural skills plus the reference skill are deferred to a later batch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Andrew Chen <andrewch@nvidia.com>

… skills) Restructure 5 procedural GPU Operator skills to the dispatch-layer information-hiding pattern (matching the prior 7-skill pass): thin SKILL.md (frontmatter + intro + Prerequisites + Activation + Phases table + cross-cutting hard rules + Verification, all <200 lines) with every step-by-step command sequence, manifest, and field detail moved into phase-specific references/*.md. Skills: custom-driver, install-service-mesh, uninstalling-nvidia, install-http-proxy, install-outdated-kernels. All five pass the close-your-eyes test (leaked cmds, bash blocks, and line count within thresholds). Verified-fix content (prerequisites, verification, <gpu-operator-version> placeholders, fixed cross-refs) preserved. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Andrew Chen <andrewch@nvidia.com>

…ready) Restructure 3 procedural GPU Operator skills to the dispatch-layer information-hiding pattern: thin SKILL.md (<200 lines) with all step-by-step detail moved into phase-specific references/*.md. Skills: nvidia-azure (AKS approaches + preinstalled-driver install), install-nvidia-enterprise (concepts, vGPU-driver, NLS-token-update, data-center-driver), install-governmentready-environments (overview, install, Ubuntu-Pro-token-update). All three pass the close-your-eyes test. Verified-fix content (prerequisites, verification, <gpu-operator-version> placeholders, cross-refs) preserved. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Andrew Chen <andrewch@nvidia.com>

…EKS, GKE) Restructure 3 procedural GPU Operator skills to the dispatch-layer information-hiding pattern: thin SKILL.md (<200 lines) with all step-by-step detail moved into phase-specific references/*.md. Skills: container-device (CDI, NRI, verification), nvidia-amazon (approaches, eksctl-example), nvidia-google (approaches, google-driver-installer, nvidia-driver-manager). All three pass the close-your-eyes test. Verified-fix content preserved (<gpu-operator-version> placeholders, cross-refs to gpu-operator-install/-references skills, verification sections). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Andrew Chen <andrewch@nvidia.com>

…pgrading-nvidia) Restructure 2 procedural GPU Operator skills to the dispatch-layer information-hiding pattern: thin SKILL.md (<200 lines) with all step-by-step detail moved into phase-specific references/*.md. Skills: driver-upgrades (concepts, upgrade-controller, without-controller), upgrading-nvidia (helm-upgrade, other-updates). While moving content, repaired pre-existing un-fenced bash blocks in the driver-upgrades troubleshooting steps (wrapped bare '$ kubectl ...' lines in console code fences). All other content preserved verbatim, including the upgrade-controller state-machine graphics reference and <gpu-operator-version> placeholders. Both pass the close-your-eyes test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Andrew Chen <andrewch@nvidia.com>

… airgapped, dra) Restructure the 4 remaining (largest) procedural GPU Operator skills to the dispatch-layer information-hiding pattern: thin SKILL.md (<200 lines) with all step-by-step detail moved into phase-specific references/*.md. Skills: precompiled-drivers (concepts, availability, enable-disable, build-custom), install-nvidia-vgpu (concepts, build-driver, configure-and-install), install-airgapped-environments (concepts, local-image-registry, local-package-repository, deploy), nvidia-dra (concepts, install-gpu-operator, install-dra-driver, health-checks). All four pass the close-your-eyes test. Verified-fix content preserved (prerequisites, verification, <gpu-operator-version> placeholders, EULA/license warnings, cross-refs to install/precompiled skills). This completes the 17-skill remaining set; with the prior 7 done, all 24 procedural gpu-operator skills are now information-hidden. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Andrew Chen <andrewch@nvidia.com>

chenopis requested review from a-mccarthy and miyoungc June 1, 2026 11:01

chenopis self-assigned this Jun 1, 2026

chenopis added documentation Issue/PR focused on fixing/editing/adding documentation bits skills AI agent skills labels Jun 1, 2026

chenopis force-pushed the chenopis/gpu-operator-skills-optimized branch from c105d7a to e849b4d Compare June 1, 2026 11:13

chenopis and others added 8 commits June 1, 2026 14:44

chenopis force-pushed the chenopis/gpu-operator-skills-optimized branch from e849b4d to 1271664 Compare June 1, 2026 21:44

chenopis and others added 5 commits June 1, 2026 21:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Companion to #401: verified review fixes + optimization pass on the 25 GPU Operator skills#422

Companion to #401: verified review fixes + optimization pass on the 25 GPU Operator skills#422
chenopis wants to merge 13 commits into
gpu-operator-docs-to-skillsfrom
chenopis/gpu-operator-skills-optimized

chenopis commented Jun 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chenopis commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!