feat(gpu): pre-bake NVIDIA CUDA kernel module into the VHD#8661
feat(gpu): pre-bake NVIDIA CUDA kernel module into the VHD#8661ganeshkumarashok wants to merge 3 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR moves NVIDIA CUDA DKMS kernel-module compilation out of the node-provisioning (CSE) critical path by optionally pre-building the module during VHD bake, and then selecting a skip-build install path at node boot when a prebake marker is present.
Changes:
- Add an opt-in (
FEATURE_FLAGS=NVIDIA_CUDA_PREBAKE) VHD-build step to run theaks-gpu-cudacontainer inbuild-onlymode and require a generated DKMS marker. - Update
configGPUDrivers(Ubuntu path) to chooseinstall-skip-buildwhen the marker exists, otherwise default toinstall. - Add ShellSpec coverage validating the marker → action selection logic.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| vhdbuilder/packer/install-dependencies.sh | Adds an opt-in prebake flow that runs the GPU container at image build time and verifies a DKMS marker. |
| parts/linux/cloud-init/artifacts/cse_config.sh | Selects install vs install-skip-build action for the GPU install container based on presence of a marker file. |
| spec/parts/linux/cloud-init/artifacts/cse_config_spec.sh | Adds ShellSpec tests to validate the action selection behavior. |
| # Capture the action passed to the install container. | ||
| retrycmd_if_failure() { shift 3; echo "INSTALL_CMD: $*"; return 0; } | ||
|
|
||
| BeforeEach 'OS="$UBUNTU_OS_NAME"; NVIDIA_DRIVER_IMAGE="mcr.microsoft.com/aks/aks-gpu-cuda"; NVIDIA_DRIVER_IMAGE_TAG="580.0.0"; CTR_GPU_INSTALL_CMD="ctr-run"; GPU_DKMS_MARKER_FILE="$(mktemp -u)"' |
|
AgentBaker Linux PR gate — E2E failure (shared test-fixture issue, NOT this PR)
Confirmed systemic — not caused by this PR. The identical sub-7s empty-error shape on the same PR #8661 only changes Confidence: HIGH that this PR is not the cause. Recommended: rerun once; do not block merge on this signal. Owner: NodeSIG-dev / E2E infra to triage the Strongest alternative (less likely): transient ACR-private-endpoint outage — refuted by the consistent sub-7s timing across 72h on multiple PRs. Side-channel (not the cause, FYI): Posted by Clawpilot AgentBaker gate detective. |
|
Folded in the non-GPU teardown (commit a70a1ae) so this stays one PR. Why: regular x86 CUDA shares its Ubuntu VHD with non-GPU nodes (no x86 GPU-dedicated VHD; only GB200 is). Pre-baking the driver leaves non-GPU nodes carrying an installed, DKMS-registered driver that recompiles Change: Open decision in-thread: sync (wired) vs async |
Move the ~100s in-CSE NVIDIA DKMS kernel-module compile off the node provisioning critical path by compiling it at VHD build time. VHD build (install-dependencies.sh), opt-in via FEATURE_FLAGS=NVIDIA_CUDA_PREBAKE: after pre-pulling the aks-gpu-cuda image, run the container in build-only mode (/entrypoint.sh build-only). aks-gpu compiles + DKMS-registers the kernel module and stages userspace libs against the VHD's kernel with NO device access (safe on the GPU-less Packer builder) and writes /opt/azure/aks-gpu/dkms-marker. Node boot (cse_config.sh): configGPUDrivers passes install-skip-build when the marker is present, so aks-gpu skips the recompile and runs only the device-init steps. aks-gpu re-validates the marker (kernel+version+kind) and falls back to a full build on mismatch (kernel upgrade / shared VHD with a different driver kind), so behaviour is unchanged on non-prebaked VHDs (no marker -> install). The driver image is intentionally kept in the VHD: boot-time device init still sources the container toolkit, fabric manager, containerd runtime config and udev rules from it. Dropping the image is a separate, deferred size optimization. Requires the aks-gpu image that supports build-only/install-skip-build (PR #159). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
A CUDA driver pre-baked into a shared Ubuntu VHD is inherited by non-GPU nodes, which never run the aks-gpu installer. There the installed driver is dead weight: it wastes disk and, while DKMS-registered, forces an nvidia.ko rebuild on every kernel patch (re-introducing the compile cost prebake removes, on nodes with no GPU). cleanUpGPUDrivers now deregisters the nvidia DKMS module and removes the relocated libs / loader config / marker on non-GPU nodes. Gated on the aks-gpu prebake marker, so it is a no-op on today's non-prebaked VHDs (backward compatible). The nvidia module is never loaded on a non-GPU node, so deregistration cannot hit "module in use". Adds ShellSpec coverage for the no-op, the teardown, and the dkms-status parser. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Incorporates fixes surfaced by multi-SKU AgentBaker e2e validation of the prebake feature, and integrates with main's logs_to_events-wrapped GPU install: - cse_config.sh: only request install-skip-build when the prebake marker's driver_kind matches THIS node's NVIDIA_GPU_DRIVER_TYPE. A CUDA-prebaked marker on a shared VHD must NOT short-circuit a GRID node's install: the GRID image may not support install-skip-build and would fail to stage its userspace files (observed as CSE exit 84 in e2e). Pass the action through the timed installGPUDriverImage wrapper added on main. - cse_install_ubuntu.sh (cleanUpPrebakedGPUDriver): drop the slow per-version 'dkms remove --all' (~35s on the non-GPU provisioning critical path) in favor of removing the DKMS source tree + built module; also remove the driver userspace BINARIES (nvidia-smi etc.) so a non-GPU node is genuinely driver-free instead of leaving nvidia-smi on PATH erroring on missing libs. - install-dependencies.sh: install gcc/make/libc6-dev before the build-only bake -- the standard non-GPU VHD builder ships gcc/make but not libc6-dev, so nvidia-installer cannot compile the module without it. The boot-time fallback recompile already gets these via installDeps, so it stays intact. - specs: cover the driver_kind guard (match -> skip-build, mismatch -> full install) and the faster binaries+marker teardown. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
a70a1ae to
2bf9460
Compare
| # aks-gpu relocates the userspace libs under GPU_DEST/lib64; on Ubuntu GPU_DEST=/usr/bin. | ||
| rm -rf /usr/bin/lib64 || true | ||
| # nvidia-installer also drops driver userspace BINARIES under GPU_DEST (=/usr/bin on Ubuntu). | ||
| # Remove them too so a non-GPU node looks genuinely driver-free: otherwise e.g. `nvidia-smi` | ||
| # remains on PATH and, with its libs (lib64) gone, errors instead of being "command not found". | ||
| for nvidiaBin in nvidia-smi nvidia-debugdump nvidia-persistenced nvidia-cuda-mps-control \ | ||
| nvidia-cuda-mps-server nvidia-modprobe nvidia-bug-report.sh nvidia-powerd \ | ||
| nvidia-ngx-updater nvidia-sleep.sh; do | ||
| rm -f "/usr/bin/${nvidiaBin}" || true | ||
| done |
| It 'deregisters the nvidia DKMS module and removes baked artifacts (libs, binaries, marker) when present' | ||
| marker="$(mktemp)" | ||
| GPU_DKMS_MARKER_FILE="${marker}" | ||
| rm() { echo "mock rm $*"; } | ||
| ldconfig() { echo "mock ldconfig"; } | ||
| When call cleanUpPrebakedGPUDriver | ||
| The status should be success | ||
| The output should include "Removing pre-baked NVIDIA driver" | ||
| # deregisters via the DKMS source tree + built module removal (no slow dkms remove) | ||
| The output should include "mock rm -rf /var/lib/dkms/nvidia" | ||
| The output should include "mock rm -f /lib/modules" | ||
| # relocated userspace libs | ||
| The output should include "mock rm -rf /usr/bin/lib64" | ||
| # driver userspace binaries so nvidia-smi becomes "command not found" on non-GPU nodes | ||
| The output should include "mock rm -f /usr/bin/nvidia-smi" | ||
| The output should include "mock ldconfig" | ||
| # the slow per-version dkms remove --all must NOT be on the critical path anymore | ||
| The output should not include "dkms remove" | ||
| End |
✅ End-to-end validation summary (real AgentBaker e2e, not just unit tests)This feature was validated by baking real prebaked VHDs through the 🎯 Capstone: CUDA prebake proven on a real Tesla V100
Marker matched → DKMS recompile skipped → driver loaded → What was validated
About the e2e "failures"The GPU e2e run reported failures, but none are caused by this PR. Every GPU scenario that failed did so only on Traceability
Note: the |
AgentBaker Linux gate detectiveRun: https://msazure.visualstudio.com/CloudNativeCompute/_build/results?buildId=168580160 RCA: E2E infra failure: shared clusters were already Failed/Deleting, including azure-network-v4, overlay-network-v4, and bootstrapprofile-cache. This reuses the shared cluster cleanup / route table deletion-blocked signature; PR GPU CUDA prebake changes are unlikely causal. Confidence: High for the primary signature. Corroborated by timeline/status, focused failed logs, associated changes, and the flakiness wiki before publishing. Strongest alternative: the GPU CUDA prebake changes broke E2E; less likely because failure occurs while acquiring pre-existing shared clusters before PR-specific validation. Recommended owner/action: AgentBaker E2E test-infra: clean/recreate stale shared clusters and harden cleanup around route table/resource dependencies. Wiki signature: $(System.Collections.Hashtable.sig) (source of truth) |
| GPU_INSTALL_ACTION="install" | ||
| GPU_DKMS_MARKER="${GPU_DKMS_MARKER_FILE:-/opt/azure/aks-gpu/dkms-marker}" |
There was a problem hiding this comment.
These two are assigned without local, so they leak into the global shell scope (bash treats any in-function assignment without local as global). The sibling installGPUDriverImage just above already uses local gpuInstallAction="${1:-install}" — suggest matching that here:
local GPU_INSTALL_ACTION="install"
local GPU_DKMS_MARKER="${GPU_DKMS_MARKER_FILE:-/opt/azure/aks-gpu/dkms-marker}"The later GPU_INSTALL_ACTION="install-skip-build" reassignment is unaffected. Not a functional bug (both are set unconditionally before use and passed explicitly as a positional arg), but CLAUDE.md calls this out: "use local variables rather than constants when their scoping allows for it" and "avoid using variables declared inside another function, even they are visible. It is hard to reason and might introduce subtle bugs."
There was a problem hiding this comment.
The aks-gpu repo should have the corresponding change to support "install-skip-build" command?
nvm, I saw the changes in Azure/aks-gpu#162
What
Move the ~80–150s in-CSE NVIDIA DKMS kernel-module compile off the node-provisioning critical path by compiling it at VHD build time.
install-dependencies.sh), opt-in viaFEATURE_FLAGS=NVIDIA_CUDA_PREBAKE: after pre-pulling theaks-gpu-cudaimage, run it in build-only mode. aks-gpu compiles + DKMS-registers the kernel module and stages userspace libs against the VHD's target kernel with no device access (safe on the GPU-less Packer builders), and writes/opt/azure/aks-gpu/dkms-marker. The build-only path first installsgcc make libc6-dev— the standard non-GPU builder ships gcc/make but notlibc6-dev, without whichnvidia-installercannot compile.cse_config.sh):configGPUDriverspasses install-skip-build when the marker is present and the marker'sdriver_kindmatches this node'sNVIDIA_GPU_DRIVER_TYPE, so aks-gpu skips the recompile and runs only device-init. aks-gpu independently re-validates the marker (kernel + version + kind) and falls back to a full build on any remaining mismatch (e.g. kernel drift), so behaviour is unchanged on non-prebaked VHDs (no marker →install). Integrated with the timedinstallGPUDriverImagewrapper onmain.gpu-driver=noneteardown (cse_install_ubuntu.sh::cleanUpPrebakedGPUDriver): when a shared Ubuntu VHD that was prebaked is used for a non-GPU node, deregister the DKMS module and remove the staged driver userspace libs and binaries (e.g.nvidia-smi) so the node is genuinely driver-free.Why
The DKMS compile dominates GPU driver install on every GPU node boot. Pre-baking removes it from the critical path. Works transparently on GPU pools (which default to
ConfigGPUDriverIfNeeded=true).Key correctness decisions (surfaced by e2e)
driver_kindguard (GRID safety): a CUDA-prebaked marker on a shared VHD must not short-circuit a GRID node's install — the GRID image may not supportinstall-skip-buildand would fail to stage its own userspace files (observed as CSE exit 84 in e2e). The guard restricts skip-build to nodes whose driver kind matches the baked marker.dkms remove: leavingnvidia-smionPATHwith its libs removed makes it error instead of being "command not found" (a non-GPU node should look driver-free). Replacing the per-versiondkms remove --all(~35s) with source-tree + built-module removal keeps the non-GPU provisioning path fast.Validation (real AgentBaker e2e,
[TEST All VHDs]pipeline)Validated end-to-end by baking prebaked VHDs (a test ACR carried the unmerged aks-gpu#162 build-only image) and running the relevant GPU e2e scenarios on Ubuntu 22.04 + 24.04 Gen2:
nvidia-smihealthy.uname -r.Notes / sequencing
aks-gpu-cudaimage incomponents.jsonmust carry that support before the flag is enabled.FEATURE_FLAGS=NVIDIA_CUDA_PREBAKEon the GPU SKU build jobs once the aks-gpu image has shipped — so this PR cannot break existing VHD builds.gcc/libc6-devviainstallDeps, so a marker mismatch still compiles at boot.Tests
make generate-testdata: no drift (the CSE scripts are dropped via cloud-init, not embedded in the customdata snapshot).verify_shellPOSIX gate: changed lines introduce no newSC3010/SC3014.