Skip to content

feat: enable GPU on vanilla arm64 VHD for GB300 (aks-gpu + fabric manager)#8745

Open
xuexu6666 wants to merge 5 commits into
mainfrom
xuex/gb300-vanilla-aksgpu-experiment
Open

feat: enable GPU on vanilla arm64 VHD for GB300 (aks-gpu + fabric manager)#8745
xuexu6666 wants to merge 5 commits into
mainfrom
xuex/gb300-vanilla-aksgpu-experiment

Conversation

@xuexu6666

Copy link
Copy Markdown

What this PR does / why we need it:

⚠️ Experimental / for discussion. Exploring whether GB300 (Standard_ND128isr_GB300_v6) can run on the vanilla 2404gen2arm64containerd VHD via the aks-gpu container path, instead of the dedicated GB image.

Two changes:

  1. vhdbuilder/packer/install-dependencies.sh — pre-pull (bake) aks-gpu-cuda for all non-GB Ubuntu images (amd64 + arm64) instead of amd64 only. aks-gpu-cuda is multi-arch (linux/arm64 exists), so CSE can install the driver at node boot on arm64. GB images stay excluded (they bake the driver via apt in the NVIDIA_GB block).
  2. pkg/agent/datamodel/gpu_components.go — add GB200/GB300 (standard_nd128isr_ndr_gb200_v6, standard_nd128isr_gb300_v6) to FabricManagerGPUSizes so GPUNeedsFabricManager is true and CSE starts nvidia-fabricmanager for NVLink/NVSwitch.

Known gap (intentionally not in this PR): full GB300 NVL also needs IMEX and DOCA/OFED RDMA, which aks-gpu does not install (verified against Azure/aks-gpu source). With this PR alone you get driver + container-toolkit + fabric manager (single-node multi-GPU NVLink), but not multi-node NVL / GPUDirect RDMA. Opening for discussion on whether this direction is worth pursuing vs. the dedicated GB image.

Which issue(s) this PR fixes:
N/A — exploratory.

…ager)

Exploratory: evaluate running GB300 (Standard_ND128isr_GB300_v6) on the vanilla
2404gen2arm64containerd VHD via the aks-gpu container path instead of the dedicated
GB image.

1) install-dependencies.sh: pre-pull aks-gpu-cuda for all non-GB Ubuntu images
   (amd64 + arm64). aks-gpu-cuda is multi-arch (linux/arm64 exists), so CSE installs
   the driver at node boot on arm64. GB images excluded (driver baked via apt in the
   NVIDIA_GB block).
2) gpu_components.go: add GB200/GB300 (standard_nd128isr_ndr_gb200_v6,
   standard_nd128isr_gb300_v6) to FabricManagerGPUSizes so CSE starts
   nvidia-fabricmanager for NVLink/NVSwitch.

Still missing for full GB300 NVL: IMEX and DOCA/OFED RDMA (not installed by aks-gpu).
For discussion.
Copilot AI review requested due to automatic review settings June 18, 2026 23:23

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR explores enabling GB300 GPUs on the vanilla Ubuntu 24.04 Gen2 ARM64 containerd VHD by relying on the aks-gpu container-based installation path (instead of a dedicated GB image), and ensures Fabric Manager is enabled for Grace-Blackwell NVLink/NVSwitch SKUs.

Changes:

  • Update VHD build dependencies to pre-pull the aks-gpu-cuda image for all non-GB Ubuntu images (including arm64) so CSE can install the NVIDIA driver at node boot.
  • Extend Fabric Manager SKU detection to include GB200/GB300 sizes so CSE can start nvidia-fabricmanager when required.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
vhdbuilder/packer/install-dependencies.sh Expands Ubuntu CUDA driver image pre-pull to include arm64 and excludes GB images via NVIDIA_GB feature flag.
pkg/agent/datamodel/gpu_components.go Adds GB200/GB300 VM sizes to FabricManagerGPUSizes so GPUNeedsFabricManager evaluates true for these SKUs.

Comment on lines +162 to +164
// GB200 / GB300 (Grace-Blackwell NVL) need fabric manager for NVLink/NVSwitch.
"standard_nd128isr_ndr_gb200_v6": true,
"standard_nd128isr_gb300_v6": true,
@aks-node-assistant

Copy link
Copy Markdown
Contributor

AgentBaker Linux gate detective

Run: https://msazure.visualstudio.com/CloudNativeCompute/_build/results?buildId=168663555
Failed job/stage/task: Run AgentBaker E2E
First failing step/test: $(System.Collections.Hashtable.first)

RCA: This is the recurring NetworkIsolated shared-cluster cleanup flake. E2E found existing cluster �be2e-azure-networkisolated-v3-d6cc9 already Failed/Deleting and failed before PR-specific scenario validation. Prior evidence for this same cluster shows ResourceGroupDeletionBlocked / InUseNetworkSecurityGroupCannotBeDeleted on �be2e-networkisolated-securityGroup still attached to the shared VNET subnet.

Confidence: High. Corroborated by timeline/status, Run AgentBaker E2E log 538, unrelated PRs hitting the same cluster/signature, and the existing repair item #38506740.

Strongest alternative: the GB300/arm64 GPU VHD enablement caused node readiness latency; less likely because the first hard blocker is the pre-existing shared cluster Deleting state reproduced across unrelated PRs.

Recommended owner/action: AgentBaker E2E test-infra owner: clean/quarantine �be2e-azure-networkisolated-v3-d6cc9, remove the stuck NSG/subnet association, and harden shared-cluster cleanup/retry.

Wiki signature:
etworkisolated-shared-cluster-delete-blocked-inuse-nsg (source of truth)

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comment thread vhdbuilder/packer/install-dependencies.sh Outdated
Comment on lines +162 to +164
// GB200 / GB300 (Grace-Blackwell NVL) need fabric manager for NVLink/NVSwitch.
"standard_nd128isr_ndr_gb200_v6": true,
"standard_nd128isr_gb300_v6": true,
@aks-node-assistant

aks-node-assistant Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

AgentBaker Linux gate detective

Run: https://msazure.visualstudio.com/CloudNativeCompute/_build/results?buildId=168756915
Failed job/stage/task: �uild2204gen2containerd / CIS validation
Summary: Build VHD reported CIS regressions detected (1) in the Ubuntu 22.04 Gen2 containerd lane. The Ubuntu 24.04 SSH service messages were warnings and do not appear to be the blocking failure.
Likely cause/signature: known Ubuntu 22.04 CIS 6.1.3.1 logfile compliance pass-to-fail regression, tracked as linux-vhd-prgate-cis-ubuntu2204-gen2-containerd-6131-logfiles.
Confidence: high.
Strongest alternative: PR #8745's GPU ARM64 changes affected shared provisioning and caused the CIS regression; less likely because the failing lane is generic Ubuntu 22.04 Gen2 containerd and the signature already has many prior occurrences.
Recommended next action/owner: VHD/CIS hardening owner: continue repair item https://msazure.visualstudio.com/CloudNativeCompute/_workitems/edit/38501652 and inspect cis-regressions.txt for the exact path/mode delta.
Wiki: linux-vhd-prgate-cis-ubuntu2204-gen2-containerd-6131-logfiles — https://dev.azure.com/msazure/09706533-03bf-4b43-9a9b-b49c75429646/_wiki/wikis/ed4a85e9-1085-4151-a39b-2753523eba2b?pagePath=%2FAKS%2FSIGs%20and%20Teams%2FAKS%20Components%2FSIG%3A%20Node%20Lifecycle%2FAI%20Agent%20Knowledge%2FAgentBaker%20Gate%20PR%20Pipeline%20Flakiness

Copilot AI review requested due to automatic review settings June 19, 2026 21:35

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comment on lines +710 to 714
# For Ubuntu, pre-pull the CUDA driver image.
# Skip GB images: they bake the GPU driver via apt in the NVIDIA_GB block below. All other Ubuntu
# images (amd64 and arm64) cache aks-gpu-cuda so CSE can install the driver at node boot.
if [ $OS = $UBUNTU_OS_NAME ] && ! grep -q "NVIDIA_GB" <<< "$FEATURE_FLAGS"; then # incl. ARM64 (e.g. GB200/300 on vanilla)
gpu_action="copy"
Comment on lines +162 to 165
// GB200 / GB300 (Grace-Blackwell NVL) need fabric manager for NVLink/NVSwitch.
"standard_nd128isr_ndr_gb200_v6": true,
"standard_nd128isr_gb300_v6": true,
// A100 oddballs.
@aks-node-assistant

Copy link
Copy Markdown
Contributor

AgentBaker Linux gate detective

Run: https://msazure.visualstudio.com/CloudNativeCompute/_build/results?buildId=168796666
Failed job/stage/task: Run AgentBaker E2E
Summary: E2E failed because shared dual-stack cluster �be2e-azure-overlay-dualstack-v6-4a24b was unexpectedly Deleting across ACL, Ubuntu, and AzureLinuxV3 SecondaryNIC dual-stack scenarios.
Likely cause/signature: known shared E2E cluster/fleet outage, shared-cluster-fleet-outage.
Confidence: high.
Strongest alternative: PR #8745's GPU bake-gate changes caused the failure; less likely because the failing tests are shared-cluster SecondaryNIC dual-stack scenarios and the same cluster state failure occurs across unrelated PRs.
Recommended next action/owner: E2E shared-cluster infra owners; continue repair item https://msazure.visualstudio.com/CloudNativeCompute/_workitems/edit/38373323.
Wiki: shared-cluster-fleet-outage — https://dev.azure.com/msazure/09706533-03bf-4b43-9a9b-b49c75429646/_wiki/wikis/ed4a85e9-1085-4151-a39b-2753523eba2b?pagePath=%2FAKS%2FSIGs%20and%20Teams%2FAKS%20Components%2FSIG%3A%20Node%20Lifecycle%2FAI%20Agent%20Knowledge%2FAgentBaker%20Gate%20PR%20Pipeline%20Flakiness

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants