Skip to content

Companion to #401: verified review fixes + optimization pass on the 25 GPU Operator skills#422

Open
chenopis wants to merge 13 commits into
gpu-operator-docs-to-skillsfrom
chenopis/gpu-operator-skills-optimized
Open

Companion to #401: verified review fixes + optimization pass on the 25 GPU Operator skills#422
chenopis wants to merge 13 commits into
gpu-operator-docs-to-skillsfrom
chenopis/gpu-operator-skills-optimized

Conversation

@chenopis

@chenopis chenopis commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator

Companion to #401 — verified review fixes + optimization pass

A companion branch to @miyoungc's #401 (GPU Operator docs → Agent Skills conversion), for side-by-side comparison: it takes the skills from #401 and applies (a) the verified review feedback posted on #401, (b) two systemic additions, and (c) an information-hiding optimization pass. The Skill-format direction is Miyoung's; this branch repairs converter-class defects and raises per-skill quality.

How to read the diff: it's large (104 files) mostly because of (c) — each procedural skill was split from one flat SKILL.md into a thin dispatch SKILL.md + a references/ directory. Compare original (flat) ↔ optimized (thin + references) per skill.

1. Verified review findings applied (from #401's review)

Critical — skill-name typo (install-ing-nvidiagpu-operator-install, #6/#1); restored 4 dropped install prerequisites (#1); inlined 12 dropped .. literalinclude:: manifests (#2); removed a leaked YAML fragment + restored prereqs in the Google skill (#3); rebuilt the mangled Google table (#4); rewrote the references-skill description to cover all 7 references (#5).
High — mapped all 9 broken :external+…:doc: roles to real URLs (#7); replaced 16 pstai_ + the nvaie-tanzu_ link-target leaks (#8); fixed the missing overview image (#9); mapped lost :ref:/:doc: cross-refs to (use the X skill) / doc links (#10).
Medium — duplicated phrase (#11), identifieis typo (#12); moved non-spec Trigger keywords into proper triggers:/tags: frontmatter — all 25 (#13); removed auto Step N: heading prefixes — 23 (#14); converted flat-bold admonitions to GitHub alerts (#15).
Bonus — fixed the nvd-precomiled-some.yaml typo; replaced unrendered ${version} template tokens with a documented <gpu-operator-version> placeholder (+ a "substitute your target release" note) rather than a brittle hardcoded version.

2. Systemic additions (the two highest-value gaps)

  • Prerequisites — added ## Prerequisites to every skill that lacked one (~17/25 did), sourced from the RST where present.
  • Verification — added ## Verification ("how do you confirm success?") to skills lacking one (a missing verification step is high-severity for a runnable skill); existing inline checks preserved.

3. Information-hiding optimization (the headline change)

All 24 procedural skills were restructured from flat docs (hundreds of lines, every step visible up front) into a thin dispatch layer + on-demand references, following the doc-optimize-skill plan-init pattern:

  • Thin SKILL.md (now 40–76 lines each, all <200): frontmatter + an ## Activation "read the phase reference first" note + a ## Phases table with one-line summaries pointing to references/<phase>.md + cross-cutting rules + Prerequisites + Verification.
  • All step-by-step detail (command sequences, manifests, field tables, decision trees) moved into references/*.md (73 reference files across the 24 skills) — loaded on demand, not up front.
  • Why it matters: the agent loads a small dispatch layer and can't "skip ahead" past steps (verified by the close-eyes test); and the always-loaded context footprint drops ~4× (lines) / ~3× (tokens) vs the flat versions — which matters when the agent's context is already busy.
  • Plus light hygiene: restored title H1s the converter had replaced, repaired admonition-boundary over-captures, unambiguous routing via cleaned description + triggers:/tags:. NVIDIA's published-doc voice preserved.

The 25th skill (gpu-operator-references) is a pure index/pointer skill and already used a references/ layout.

Scope & method

  • 25 SKILL.md files restructured + 73 references/*.md files created; all command blocks/manifests/tables preserved verbatim except where fixing a flagged defect.
  • Verified: all 24 procedural skills pass the close-eyes test; zero residual ${version}/:external+/Step N:/Trigger keywords/flat-bold-admonition occurrences; every frontmatter parses as valid YAML with a single H1; all (use the X skill) cross-refs and SKILL.md → references/*.md links resolve.

Flagged rather than guessed

  • :external+ctkconfiguration mapped to the Container Toolkit install-guide page (adjust if a dedicated config page exists).
  • Overview image points at the published _images/ URL rather than committing the binary.
  • Bare cross-refs inside the large historical references/release-notes.md changelog left as prose (low value, high volume); user-facing SKILL.md cross-refs all repaired.

🤖 Generated with Claude Code

@chenopis chenopis requested review from a-mccarthy and miyoungc June 1, 2026 11:01
@chenopis chenopis self-assigned this Jun 1, 2026
@chenopis chenopis added documentation Issue/PR focused on fixing/editing/adding documentation bits skills AI agent skills labels Jun 1, 2026
@chenopis chenopis force-pushed the chenopis/gpu-operator-skills-optimized branch from c105d7a to e849b4d Compare June 1, 2026 11:13
chenopis and others added 8 commits June 1, 2026 14:44
Applies verified #401 review findings and the doc-optimize-skill quality
pass to the two highest-blast-radius skills.

gpu-operator-install (renamed from the converter typo
gpu-operator-install-ing-nvidia, finding #6/#1):
- Restored the 4 silently-dropped prerequisites from getting-started.rst
  (ClusterPolicy/OS constraints, container engine, PSA labeling, NFD
  detection) — finding #1.
- Inlined the dropped tf-notebook.yaml literalinclude — finding #2.
- Fixed 5 broken :external+ Sphinx roles to real OpenShift/CTK/Edge URLs
  — finding #7.
- Mapped lost :ref:/:doc: cross-refs to (use the X skill) or published
  doc links — finding #10.
- Fixed nvaie-tanzu_ RST link leak in the platforms table.
- Dropped the Trigger keywords suffix; added triggers:/tags: arrays
  — finding #13.
- Removed Step N: H2 prefixes — finding #14.
- Converted flat-bold admonitions to GitHub alerts — finding #15.
- Replaced 11 ${version} leaks with v26.3.1.

gpu-operator-references:
- Rewrote the mis-routed description (claimed confidential-containers
  only; loads 7 references) — finding #5; dropped Trigger suffix, added
  triggers:/tags:.
- overview.md: fixed missing image asset (#9), 16 pstai_ link leaks (#8),
  lost cross-refs (#10), :external+ocpindex role (#7), identifieis typo
  (#12).
- security.md: removed duplicated phrase (#11).
- life-cycle-policy.md / platform-support.md / release-notes.md:
  converted flat-bold admonitions to GitHub alerts (#15) and repaired
  admonition-boundary over-captures + an ocp_csp_support substitution
  leak surfaced during conversion.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Andrew Chen <andrewch@nvidia.com>
Applies the verified #401 systemic findings to all remaining skills:
- Dropped Trigger keywords suffix; added triggers:/tags: frontmatter
  arrays sourced from each page's .. meta:: block (#13).
- Removed Step N: H2/H3 prefixes (#14).
- Converted flat-bold admonitions to GitHub alerts (#15).
- Replaced ${version} leaks with v26.3.1.
- Fixed 4 broken :external+ Sphinx roles to real OpenShift doc URLs (#7).
- Restored 12 silently dropped .. literalinclude:: code blocks from the
  source manifests across nvidia-driver (4), timeslicing (3),
  gpudirect-rdma (2), google (2), multiinstance (1), amazon (1) (#2).
- Fixed misspelled asset nvd-precomiled-some.yaml -> nvd-precompiled-some.yaml.

Prerequisites/Verification sections and remaining cross-ref repairs land
in the following per-skill commits.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Andrew Chen <andrewch@nvidia.com>
…refs

- Fixed 11 cases where a numbered procedure step was swept into a GitHub
  alert because the flattened RST source lost the blank-line boundary
  (container-device, kata-containers, multiinstance, amazon, nvidia-driver).
- Mapped remaining bare-text :ref: cross-refs in vgpu, multiinstance, and
  government-ready SKILLs to published doc links (#10).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Andrew Chen <andrewch@nvidia.com>
Adds systemic Prerequisites and/or Verification sections to
container-device, custom-driver, driver-upgrades, nvidia-driver,
nvidia-azure, and service-mesh skills (#systemic). Fixed a bare
getting-started cross-ref in service-mesh.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Andrew Chen <andrewch@nvidia.com>
Adds Prerequisites to outdated-kernels, airgapped, uninstalling,
gpudirect-rdma, multiinstance, and timeslicing skills. Repaired the
uninstalling NOTE admonition over-capture (two-bullet note + code block)
and mapped timeslicing in-page section cross-refs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Andrew Chen <andrewch@nvidia.com>
- Restored title H1s that the converter replaced with a bare # Prerequisites
  heading, moving prerequisites to a ## section after the real title
  (http-proxy, vgpu, kubevirt, dra, google, upgrading).
- Restored dropped prerequisite list bodies from source RST (vgpu, kubevirt,
  google, upgrading).
- Fixed the google mangled approaches table (finding #4) and stray
  '- name: RUNTIME_CONFIG_SOURCE' YAML fragment leak (finding #3).
- Fixed http-proxy admonition over-capture + proxy_config_openshift ref.
- Added Prerequisites to gov-ready, kata, precompiled, amazon, nvaie.
- Added Verification sections to nvaie, vgpu, amazon, google, upgrading,
  service-mesh, nvidia-driver, container-device.
- Mapped remaining bare cross-refs (nvaie component-matrix/install,
  google/upgrading skill refs).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Andrew Chen <andrewch@nvidia.com>
… placeholder

A prior optimization pass replaced the source ${version} Sphinx
substitution with a frozen patch version (v26.3.1) across the skills,
which freezes a specific release and goes stale. The original #401
finding was a ${version} leak (raw template var in rendered output),
so the fix is a non-leaking, non-frozen value.

Replace every command/URL/image-tag occurrence of v26.3.1 introduced on
this branch with the angle-bracket placeholder <gpu-operator-version>
(matching the project's existing <supported-version> / <driver-branch>
convention), and add a brief inline note on first use per file pointing
to the GPU Operator releases page.

The version-specific factual reference data (release-notes.md changelog
heading, life-cycle-policy.md version table) is left intact — that
content genuinely describes a specific historical release and was not
introduced by the optimization pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Andrew Chen <andrewch@nvidia.com>
…ormation-hiding dispatch layout

Convert the 7 largest procedural skills (450-583 lines each) into a thin
dispatch-layer SKILL.md (each <200 lines, all under ~55 non-frontmatter
lines) plus phase-specific references/*.md files. Step-by-step command
sequences, manifests, field tables, and verification output move out of
SKILL.md into references/, so the dispatch layer cannot leak later-phase
procedural detail (structural no-skip property; no tooling dependency).

All verified-fix content from the prior optimization pass (prerequisites,
verification sections, fixed cross-refs, <gpu-operator-version> placeholders)
is preserved and relocated, not removed.

Skills restructured (before -> after SKILL.md raw line count):
- gpu-operator-kata-containers   583 -> 65
- gpu-operator-install           572 -> 71
- gpu-operator-multiinstance     560 -> 62
- gpu-operator-gpudirect-rdma    555 -> 67
- gpu-operator-kubevirt          515 -> 68
- gpu-operator-timeslicing-gpus  492 -> 62
- gpu-operator-nvidia-driver     451 -> 60

Each passes the close-your-eyes dispatch-layer test (leaked cmds, bash
blocks, and line count all within thresholds). Internal SKILL.md ->
references/*.md links verified to resolve.

The remaining 17 procedural skills plus the reference skill are deferred
to a later batch.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Andrew Chen <andrewch@nvidia.com>
@chenopis chenopis force-pushed the chenopis/gpu-operator-skills-optimized branch from e849b4d to 1271664 Compare June 1, 2026 21:44
chenopis and others added 5 commits June 1, 2026 21:48
… skills)

Restructure 5 procedural GPU Operator skills to the dispatch-layer
information-hiding pattern (matching the prior 7-skill pass): thin
SKILL.md (frontmatter + intro + Prerequisites + Activation + Phases
table + cross-cutting hard rules + Verification, all <200 lines) with
every step-by-step command sequence, manifest, and field detail moved
into phase-specific references/*.md.

Skills: custom-driver, install-service-mesh, uninstalling-nvidia,
install-http-proxy, install-outdated-kernels.

All five pass the close-your-eyes test (leaked cmds, bash blocks, and
line count within thresholds). Verified-fix content (prerequisites,
verification, <gpu-operator-version> placeholders, fixed cross-refs)
preserved.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Andrew Chen <andrewch@nvidia.com>
…ready)

Restructure 3 procedural GPU Operator skills to the dispatch-layer
information-hiding pattern: thin SKILL.md (<200 lines) with all
step-by-step detail moved into phase-specific references/*.md.

Skills: nvidia-azure (AKS approaches + preinstalled-driver install),
install-nvidia-enterprise (concepts, vGPU-driver, NLS-token-update,
data-center-driver), install-governmentready-environments (overview,
install, Ubuntu-Pro-token-update).

All three pass the close-your-eyes test. Verified-fix content
(prerequisites, verification, <gpu-operator-version> placeholders,
cross-refs) preserved.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Andrew Chen <andrewch@nvidia.com>
…EKS, GKE)

Restructure 3 procedural GPU Operator skills to the dispatch-layer
information-hiding pattern: thin SKILL.md (<200 lines) with all
step-by-step detail moved into phase-specific references/*.md.

Skills: container-device (CDI, NRI, verification), nvidia-amazon
(approaches, eksctl-example), nvidia-google (approaches,
google-driver-installer, nvidia-driver-manager).

All three pass the close-your-eyes test. Verified-fix content
preserved (<gpu-operator-version> placeholders, cross-refs to
gpu-operator-install/-references skills, verification sections).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Andrew Chen <andrewch@nvidia.com>
…pgrading-nvidia)

Restructure 2 procedural GPU Operator skills to the dispatch-layer
information-hiding pattern: thin SKILL.md (<200 lines) with all
step-by-step detail moved into phase-specific references/*.md.

Skills: driver-upgrades (concepts, upgrade-controller, without-controller),
upgrading-nvidia (helm-upgrade, other-updates).

While moving content, repaired pre-existing un-fenced bash blocks in the
driver-upgrades troubleshooting steps (wrapped bare '$ kubectl ...' lines
in console code fences). All other content preserved verbatim, including
the upgrade-controller state-machine graphics reference and
<gpu-operator-version> placeholders.

Both pass the close-your-eyes test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Andrew Chen <andrewch@nvidia.com>
… airgapped, dra)

Restructure the 4 remaining (largest) procedural GPU Operator skills to
the dispatch-layer information-hiding pattern: thin SKILL.md (<200 lines)
with all step-by-step detail moved into phase-specific references/*.md.

Skills: precompiled-drivers (concepts, availability, enable-disable,
build-custom), install-nvidia-vgpu (concepts, build-driver,
configure-and-install), install-airgapped-environments (concepts,
local-image-registry, local-package-repository, deploy), nvidia-dra
(concepts, install-gpu-operator, install-dra-driver, health-checks).

All four pass the close-your-eyes test. Verified-fix content preserved
(prerequisites, verification, <gpu-operator-version> placeholders,
EULA/license warnings, cross-refs to install/precompiled skills).

This completes the 17-skill remaining set; with the prior 7 done, all 24
procedural gpu-operator skills are now information-hidden.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Andrew Chen <andrewch@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Issue/PR focused on fixing/editing/adding documentation bits skills AI agent skills

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant