feat: enable GPU on vanilla arm64 VHD for GB300 (aks-gpu + fabric manager)#8745
feat: enable GPU on vanilla arm64 VHD for GB300 (aks-gpu + fabric manager)#8745xuexu6666 wants to merge 5 commits into
Conversation
…ager) Exploratory: evaluate running GB300 (Standard_ND128isr_GB300_v6) on the vanilla 2404gen2arm64containerd VHD via the aks-gpu container path instead of the dedicated GB image. 1) install-dependencies.sh: pre-pull aks-gpu-cuda for all non-GB Ubuntu images (amd64 + arm64). aks-gpu-cuda is multi-arch (linux/arm64 exists), so CSE installs the driver at node boot on arm64. GB images excluded (driver baked via apt in the NVIDIA_GB block). 2) gpu_components.go: add GB200/GB300 (standard_nd128isr_ndr_gb200_v6, standard_nd128isr_gb300_v6) to FabricManagerGPUSizes so CSE starts nvidia-fabricmanager for NVLink/NVSwitch. Still missing for full GB300 NVL: IMEX and DOCA/OFED RDMA (not installed by aks-gpu). For discussion.
There was a problem hiding this comment.
Pull request overview
This PR explores enabling GB300 GPUs on the vanilla Ubuntu 24.04 Gen2 ARM64 containerd VHD by relying on the aks-gpu container-based installation path (instead of a dedicated GB image), and ensures Fabric Manager is enabled for Grace-Blackwell NVLink/NVSwitch SKUs.
Changes:
- Update VHD build dependencies to pre-pull the
aks-gpu-cudaimage for all non-GB Ubuntu images (including arm64) so CSE can install the NVIDIA driver at node boot. - Extend Fabric Manager SKU detection to include GB200/GB300 sizes so CSE can start
nvidia-fabricmanagerwhen required.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
vhdbuilder/packer/install-dependencies.sh |
Expands Ubuntu CUDA driver image pre-pull to include arm64 and excludes GB images via NVIDIA_GB feature flag. |
pkg/agent/datamodel/gpu_components.go |
Adds GB200/GB300 VM sizes to FabricManagerGPUSizes so GPUNeedsFabricManager evaluates true for these SKUs. |
| // GB200 / GB300 (Grace-Blackwell NVL) need fabric manager for NVLink/NVSwitch. | ||
| "standard_nd128isr_ndr_gb200_v6": true, | ||
| "standard_nd128isr_gb300_v6": true, |
AgentBaker Linux gate detectiveRun: https://msazure.visualstudio.com/CloudNativeCompute/_build/results?buildId=168663555 RCA: This is the recurring NetworkIsolated shared-cluster cleanup flake. E2E found existing cluster �be2e-azure-networkisolated-v3-d6cc9 already Failed/Deleting and failed before PR-specific scenario validation. Prior evidence for this same cluster shows ResourceGroupDeletionBlocked / InUseNetworkSecurityGroupCannotBeDeleted on �be2e-networkisolated-securityGroup still attached to the shared VNET subnet. Confidence: High. Corroborated by timeline/status, Run AgentBaker E2E log 538, unrelated PRs hitting the same cluster/signature, and the existing repair item #38506740. Strongest alternative: the GB300/arm64 GPU VHD enablement caused node readiness latency; less likely because the first hard blocker is the pre-existing shared cluster Deleting state reproduced across unrelated PRs. Recommended owner/action: AgentBaker E2E test-infra owner: clean/quarantine �be2e-azure-networkisolated-v3-d6cc9, remove the stuck NSG/subnet association, and harden shared-cluster cleanup/retry. Wiki signature: |
| // GB200 / GB300 (Grace-Blackwell NVL) need fabric manager for NVLink/NVSwitch. | ||
| "standard_nd128isr_ndr_gb200_v6": true, | ||
| "standard_nd128isr_gb300_v6": true, |
AgentBaker Linux gate detectiveRun: https://msazure.visualstudio.com/CloudNativeCompute/_build/results?buildId=168756915 |
| # For Ubuntu, pre-pull the CUDA driver image. | ||
| # Skip GB images: they bake the GPU driver via apt in the NVIDIA_GB block below. All other Ubuntu | ||
| # images (amd64 and arm64) cache aks-gpu-cuda so CSE can install the driver at node boot. | ||
| if [ $OS = $UBUNTU_OS_NAME ] && ! grep -q "NVIDIA_GB" <<< "$FEATURE_FLAGS"; then # incl. ARM64 (e.g. GB200/300 on vanilla) | ||
| gpu_action="copy" |
| // GB200 / GB300 (Grace-Blackwell NVL) need fabric manager for NVLink/NVSwitch. | ||
| "standard_nd128isr_ndr_gb200_v6": true, | ||
| "standard_nd128isr_gb300_v6": true, | ||
| // A100 oddballs. |
AgentBaker Linux gate detectiveRun: https://msazure.visualstudio.com/CloudNativeCompute/_build/results?buildId=168796666 |
What this PR does / why we need it:
Standard_ND128isr_GB300_v6) can run on the vanilla2404gen2arm64containerdVHD via the aks-gpu container path, instead of the dedicated GB image.Two changes:
vhdbuilder/packer/install-dependencies.sh— pre-pull (bake)aks-gpu-cudafor all non-GB Ubuntu images (amd64 + arm64) instead of amd64 only.aks-gpu-cudais multi-arch (linux/arm64exists), so CSE can install the driver at node boot on arm64. GB images stay excluded (they bake the driver via apt in theNVIDIA_GBblock).pkg/agent/datamodel/gpu_components.go— add GB200/GB300 (standard_nd128isr_ndr_gb200_v6,standard_nd128isr_gb300_v6) toFabricManagerGPUSizessoGPUNeedsFabricManageris true and CSE startsnvidia-fabricmanagerfor NVLink/NVSwitch.Known gap (intentionally not in this PR): full GB300 NVL also needs IMEX and DOCA/OFED RDMA, which
aks-gpudoes not install (verified againstAzure/aks-gpusource). With this PR alone you get driver + container-toolkit + fabric manager (single-node multi-GPU NVLink), but not multi-node NVL / GPUDirect RDMA. Opening for discussion on whether this direction is worth pursuing vs. the dedicated GB image.Which issue(s) this PR fixes:
N/A — exploratory.