Companion to #401: verified review fixes + optimization pass on the 25 GPU Operator skills#422
Open
chenopis wants to merge 13 commits into
Open
Conversation
c105d7a to
e849b4d
Compare
Applies verified #401 review findings and the doc-optimize-skill quality pass to the two highest-blast-radius skills. gpu-operator-install (renamed from the converter typo gpu-operator-install-ing-nvidia, finding #6/#1): - Restored the 4 silently-dropped prerequisites from getting-started.rst (ClusterPolicy/OS constraints, container engine, PSA labeling, NFD detection) — finding #1. - Inlined the dropped tf-notebook.yaml literalinclude — finding #2. - Fixed 5 broken :external+ Sphinx roles to real OpenShift/CTK/Edge URLs — finding #7. - Mapped lost :ref:/:doc: cross-refs to (use the X skill) or published doc links — finding #10. - Fixed nvaie-tanzu_ RST link leak in the platforms table. - Dropped the Trigger keywords suffix; added triggers:/tags: arrays — finding #13. - Removed Step N: H2 prefixes — finding #14. - Converted flat-bold admonitions to GitHub alerts — finding #15. - Replaced 11 ${version} leaks with v26.3.1. gpu-operator-references: - Rewrote the mis-routed description (claimed confidential-containers only; loads 7 references) — finding #5; dropped Trigger suffix, added triggers:/tags:. - overview.md: fixed missing image asset (#9), 16 pstai_ link leaks (#8), lost cross-refs (#10), :external+ocpindex role (#7), identifieis typo (#12). - security.md: removed duplicated phrase (#11). - life-cycle-policy.md / platform-support.md / release-notes.md: converted flat-bold admonitions to GitHub alerts (#15) and repaired admonition-boundary over-captures + an ocp_csp_support substitution leak surfaced during conversion. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Andrew Chen <andrewch@nvidia.com>
Applies the verified #401 systemic findings to all remaining skills: - Dropped Trigger keywords suffix; added triggers:/tags: frontmatter arrays sourced from each page's .. meta:: block (#13). - Removed Step N: H2/H3 prefixes (#14). - Converted flat-bold admonitions to GitHub alerts (#15). - Replaced ${version} leaks with v26.3.1. - Fixed 4 broken :external+ Sphinx roles to real OpenShift doc URLs (#7). - Restored 12 silently dropped .. literalinclude:: code blocks from the source manifests across nvidia-driver (4), timeslicing (3), gpudirect-rdma (2), google (2), multiinstance (1), amazon (1) (#2). - Fixed misspelled asset nvd-precomiled-some.yaml -> nvd-precompiled-some.yaml. Prerequisites/Verification sections and remaining cross-ref repairs land in the following per-skill commits. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Andrew Chen <andrewch@nvidia.com>
…refs - Fixed 11 cases where a numbered procedure step was swept into a GitHub alert because the flattened RST source lost the blank-line boundary (container-device, kata-containers, multiinstance, amazon, nvidia-driver). - Mapped remaining bare-text :ref: cross-refs in vgpu, multiinstance, and government-ready SKILLs to published doc links (#10). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Andrew Chen <andrewch@nvidia.com>
Adds systemic Prerequisites and/or Verification sections to container-device, custom-driver, driver-upgrades, nvidia-driver, nvidia-azure, and service-mesh skills (#systemic). Fixed a bare getting-started cross-ref in service-mesh. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Andrew Chen <andrewch@nvidia.com>
Adds Prerequisites to outdated-kernels, airgapped, uninstalling, gpudirect-rdma, multiinstance, and timeslicing skills. Repaired the uninstalling NOTE admonition over-capture (two-bullet note + code block) and mapped timeslicing in-page section cross-refs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Andrew Chen <andrewch@nvidia.com>
- Restored title H1s that the converter replaced with a bare # Prerequisites heading, moving prerequisites to a ## section after the real title (http-proxy, vgpu, kubevirt, dra, google, upgrading). - Restored dropped prerequisite list bodies from source RST (vgpu, kubevirt, google, upgrading). - Fixed the google mangled approaches table (finding #4) and stray '- name: RUNTIME_CONFIG_SOURCE' YAML fragment leak (finding #3). - Fixed http-proxy admonition over-capture + proxy_config_openshift ref. - Added Prerequisites to gov-ready, kata, precompiled, amazon, nvaie. - Added Verification sections to nvaie, vgpu, amazon, google, upgrading, service-mesh, nvidia-driver, container-device. - Mapped remaining bare cross-refs (nvaie component-matrix/install, google/upgrading skill refs). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Andrew Chen <andrewch@nvidia.com>
… placeholder
A prior optimization pass replaced the source ${version} Sphinx
substitution with a frozen patch version (v26.3.1) across the skills,
which freezes a specific release and goes stale. The original #401
finding was a ${version} leak (raw template var in rendered output),
so the fix is a non-leaking, non-frozen value.
Replace every command/URL/image-tag occurrence of v26.3.1 introduced on
this branch with the angle-bracket placeholder <gpu-operator-version>
(matching the project's existing <supported-version> / <driver-branch>
convention), and add a brief inline note on first use per file pointing
to the GPU Operator releases page.
The version-specific factual reference data (release-notes.md changelog
heading, life-cycle-policy.md version table) is left intact — that
content genuinely describes a specific historical release and was not
introduced by the optimization pass.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Andrew Chen <andrewch@nvidia.com>
…ormation-hiding dispatch layout Convert the 7 largest procedural skills (450-583 lines each) into a thin dispatch-layer SKILL.md (each <200 lines, all under ~55 non-frontmatter lines) plus phase-specific references/*.md files. Step-by-step command sequences, manifests, field tables, and verification output move out of SKILL.md into references/, so the dispatch layer cannot leak later-phase procedural detail (structural no-skip property; no tooling dependency). All verified-fix content from the prior optimization pass (prerequisites, verification sections, fixed cross-refs, <gpu-operator-version> placeholders) is preserved and relocated, not removed. Skills restructured (before -> after SKILL.md raw line count): - gpu-operator-kata-containers 583 -> 65 - gpu-operator-install 572 -> 71 - gpu-operator-multiinstance 560 -> 62 - gpu-operator-gpudirect-rdma 555 -> 67 - gpu-operator-kubevirt 515 -> 68 - gpu-operator-timeslicing-gpus 492 -> 62 - gpu-operator-nvidia-driver 451 -> 60 Each passes the close-your-eyes dispatch-layer test (leaked cmds, bash blocks, and line count all within thresholds). Internal SKILL.md -> references/*.md links verified to resolve. The remaining 17 procedural skills plus the reference skill are deferred to a later batch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Andrew Chen <andrewch@nvidia.com>
e849b4d to
1271664
Compare
… skills) Restructure 5 procedural GPU Operator skills to the dispatch-layer information-hiding pattern (matching the prior 7-skill pass): thin SKILL.md (frontmatter + intro + Prerequisites + Activation + Phases table + cross-cutting hard rules + Verification, all <200 lines) with every step-by-step command sequence, manifest, and field detail moved into phase-specific references/*.md. Skills: custom-driver, install-service-mesh, uninstalling-nvidia, install-http-proxy, install-outdated-kernels. All five pass the close-your-eyes test (leaked cmds, bash blocks, and line count within thresholds). Verified-fix content (prerequisites, verification, <gpu-operator-version> placeholders, fixed cross-refs) preserved. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Andrew Chen <andrewch@nvidia.com>
…ready) Restructure 3 procedural GPU Operator skills to the dispatch-layer information-hiding pattern: thin SKILL.md (<200 lines) with all step-by-step detail moved into phase-specific references/*.md. Skills: nvidia-azure (AKS approaches + preinstalled-driver install), install-nvidia-enterprise (concepts, vGPU-driver, NLS-token-update, data-center-driver), install-governmentready-environments (overview, install, Ubuntu-Pro-token-update). All three pass the close-your-eyes test. Verified-fix content (prerequisites, verification, <gpu-operator-version> placeholders, cross-refs) preserved. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Andrew Chen <andrewch@nvidia.com>
…EKS, GKE) Restructure 3 procedural GPU Operator skills to the dispatch-layer information-hiding pattern: thin SKILL.md (<200 lines) with all step-by-step detail moved into phase-specific references/*.md. Skills: container-device (CDI, NRI, verification), nvidia-amazon (approaches, eksctl-example), nvidia-google (approaches, google-driver-installer, nvidia-driver-manager). All three pass the close-your-eyes test. Verified-fix content preserved (<gpu-operator-version> placeholders, cross-refs to gpu-operator-install/-references skills, verification sections). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Andrew Chen <andrewch@nvidia.com>
…pgrading-nvidia) Restructure 2 procedural GPU Operator skills to the dispatch-layer information-hiding pattern: thin SKILL.md (<200 lines) with all step-by-step detail moved into phase-specific references/*.md. Skills: driver-upgrades (concepts, upgrade-controller, without-controller), upgrading-nvidia (helm-upgrade, other-updates). While moving content, repaired pre-existing un-fenced bash blocks in the driver-upgrades troubleshooting steps (wrapped bare '$ kubectl ...' lines in console code fences). All other content preserved verbatim, including the upgrade-controller state-machine graphics reference and <gpu-operator-version> placeholders. Both pass the close-your-eyes test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Andrew Chen <andrewch@nvidia.com>
… airgapped, dra) Restructure the 4 remaining (largest) procedural GPU Operator skills to the dispatch-layer information-hiding pattern: thin SKILL.md (<200 lines) with all step-by-step detail moved into phase-specific references/*.md. Skills: precompiled-drivers (concepts, availability, enable-disable, build-custom), install-nvidia-vgpu (concepts, build-driver, configure-and-install), install-airgapped-environments (concepts, local-image-registry, local-package-repository, deploy), nvidia-dra (concepts, install-gpu-operator, install-dra-driver, health-checks). All four pass the close-your-eyes test. Verified-fix content preserved (prerequisites, verification, <gpu-operator-version> placeholders, EULA/license warnings, cross-refs to install/precompiled skills). This completes the 17-skill remaining set; with the prior 7 done, all 24 procedural gpu-operator skills are now information-hidden. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Andrew Chen <andrewch@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Companion to #401 — verified review fixes + optimization pass
A companion branch to @miyoungc's #401 (GPU Operator docs → Agent Skills conversion), for side-by-side comparison: it takes the skills from #401 and applies (a) the verified review feedback posted on #401, (b) two systemic additions, and (c) an information-hiding optimization pass. The Skill-format direction is Miyoung's; this branch repairs converter-class defects and raises per-skill quality.
1. Verified review findings applied (from #401's review)
Critical — skill-name typo (
install-ing-nvidia→gpu-operator-install, #6/#1); restored 4 dropped install prerequisites (#1); inlined 12 dropped.. literalinclude::manifests (#2); removed a leaked YAML fragment + restored prereqs in the Google skill (#3); rebuilt the mangled Google table (#4); rewrote the references-skill description to cover all 7 references (#5).High — mapped all 9 broken
:external+…:doc:roles to real URLs (#7); replaced 16pstai_+ thenvaie-tanzu_link-target leaks (#8); fixed the missing overview image (#9); mapped lost:ref:/:doc:cross-refs to(use the X skill)/ doc links (#10).Medium — duplicated phrase (#11),
identifieistypo (#12); moved non-specTrigger keywordsinto propertriggers:/tags:frontmatter — all 25 (#13); removed autoStep N:heading prefixes — 23 (#14); converted flat-bold admonitions to GitHub alerts (#15).Bonus — fixed the
nvd-precomiled-some.yamltypo; replaced unrendered${version}template tokens with a documented<gpu-operator-version>placeholder (+ a "substitute your target release" note) rather than a brittle hardcoded version.2. Systemic additions (the two highest-value gaps)
## Prerequisitesto every skill that lacked one (~17/25 did), sourced from the RST where present.## Verification("how do you confirm success?") to skills lacking one (a missing verification step is high-severity for a runnable skill); existing inline checks preserved.3. Information-hiding optimization (the headline change)
All 24 procedural skills were restructured from flat docs (hundreds of lines, every step visible up front) into a thin dispatch layer + on-demand references, following the
doc-optimize-skillplan-init pattern:SKILL.md(now 40–76 lines each, all <200): frontmatter + an## Activation"read the phase reference first" note + a## Phasestable with one-line summaries pointing toreferences/<phase>.md+ cross-cutting rules + Prerequisites + Verification.references/*.md(73 reference files across the 24 skills) — loaded on demand, not up front.description+triggers:/tags:. NVIDIA's published-doc voice preserved.The 25th skill (
gpu-operator-references) is a pure index/pointer skill and already used areferences/layout.Scope & method
SKILL.mdfiles restructured + 73references/*.mdfiles created; all command blocks/manifests/tables preserved verbatim except where fixing a flagged defect.${version}/:external+/Step N:/Trigger keywords/flat-bold-admonition occurrences; every frontmatter parses as valid YAML with a single H1; all(use the X skill)cross-refs andSKILL.md → references/*.mdlinks resolve.Flagged rather than guessed
:external+ctkconfigurationmapped to the Container Toolkit install-guide page (adjust if a dedicated config page exists)._images/URL rather than committing the binary.references/release-notes.mdchangelog left as prose (low value, high volume); user-facing SKILL.md cross-refs all repaired.🤖 Generated with Claude Code