feat: enable GPU on vanilla arm64 VHD for GB300 (aks-gpu + fabric manager) by xuexu6666 · Pull Request #8745 · Azure/AgentBaker

xuexu6666 · 2026-06-18T23:23:45Z

What this PR does / why we need it:

⚠️ Experimental / for discussion. Exploring whether GB300 (Standard_ND128isr_GB300_v6) can run on the vanilla 2404gen2arm64containerd VHD via the aks-gpu container path, instead of the dedicated GB image.

Two changes:

vhdbuilder/packer/install-dependencies.sh — pre-pull (bake) aks-gpu-cuda for all non-GB Ubuntu images (amd64 + arm64) instead of amd64 only. aks-gpu-cuda is multi-arch (linux/arm64 exists), so CSE can install the driver at node boot on arm64. GB images stay excluded (they bake the driver via apt in the NVIDIA_GB block).
pkg/agent/datamodel/gpu_components.go — add GB200/GB300 (standard_nd128isr_ndr_gb200_v6, standard_nd128isr_gb300_v6) to FabricManagerGPUSizes so GPUNeedsFabricManager is true and CSE starts nvidia-fabricmanager for NVLink/NVSwitch.

Known gap (intentionally not in this PR): full GB300 NVL also needs IMEX and DOCA/OFED RDMA, which aks-gpu does not install (verified against Azure/aks-gpu source). With this PR alone you get driver + container-toolkit + fabric manager (single-node multi-GPU NVLink), but not multi-node NVL / GPUDirect RDMA. Opening for discussion on whether this direction is worth pursuing vs. the dedicated GB image.

Which issue(s) this PR fixes:
N/A — exploratory.

…ager) Exploratory: evaluate running GB300 (Standard_ND128isr_GB300_v6) on the vanilla 2404gen2arm64containerd VHD via the aks-gpu container path instead of the dedicated GB image. 1) install-dependencies.sh: pre-pull aks-gpu-cuda for all non-GB Ubuntu images (amd64 + arm64). aks-gpu-cuda is multi-arch (linux/arm64 exists), so CSE installs the driver at node boot on arm64. GB images excluded (driver baked via apt in the NVIDIA_GB block). 2) gpu_components.go: add GB200/GB300 (standard_nd128isr_ndr_gb200_v6, standard_nd128isr_gb300_v6) to FabricManagerGPUSizes so CSE starts nvidia-fabricmanager for NVLink/NVSwitch. Still missing for full GB300 NVL: IMEX and DOCA/OFED RDMA (not installed by aks-gpu). For discussion.

Copilot

Pull request overview

This PR explores enabling GB300 GPUs on the vanilla Ubuntu 24.04 Gen2 ARM64 containerd VHD by relying on the aks-gpu container-based installation path (instead of a dedicated GB image), and ensures Fabric Manager is enabled for Grace-Blackwell NVLink/NVSwitch SKUs.

Changes:

Update VHD build dependencies to pre-pull the aks-gpu-cuda image for all non-GB Ubuntu images (including arm64) so CSE can install the NVIDIA driver at node boot.
Extend Fabric Manager SKU detection to include GB200/GB300 sizes so CSE can start nvidia-fabricmanager when required.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
`vhdbuilder/packer/install-dependencies.sh`	Expands Ubuntu CUDA driver image pre-pull to include arm64 and excludes GB images via `NVIDIA_GB` feature flag.
`pkg/agent/datamodel/gpu_components.go`	Adds GB200/GB300 VM sizes to `FabricManagerGPUSizes` so `GPUNeedsFabricManager` evaluates true for these SKUs.

+	// GB200 / GB300 (Grace-Blackwell NVL) need fabric manager for NVLink/NVSwitch.
+	"standard_nd128isr_ndr_gb200_v6": true,
+	"standard_nd128isr_gb300_v6":     true,


aks-node-assistant · 2026-06-19T20:07:43Z

AgentBaker Linux gate detective

Run: https://msazure.visualstudio.com/CloudNativeCompute/_build/results?buildId=168663555
Failed job/stage/task: Run AgentBaker E2E
First failing step/test: $(System.Collections.Hashtable.first)

RCA: This is the recurring NetworkIsolated shared-cluster cleanup flake. E2E found existing cluster �be2e-azure-networkisolated-v3-d6cc9 already Failed/Deleting and failed before PR-specific scenario validation. Prior evidence for this same cluster shows ResourceGroupDeletionBlocked / InUseNetworkSecurityGroupCannotBeDeleted on �be2e-networkisolated-securityGroup still attached to the shared VNET subnet.

Confidence: High. Corroborated by timeline/status, Run AgentBaker E2E log 538, unrelated PRs hitting the same cluster/signature, and the existing repair item #38506740.

Strongest alternative: the GB300/arm64 GPU VHD enablement caused node readiness latency; less likely because the first hard blocker is the pre-existing shared cluster Deleting state reproduced across unrelated PRs.

Recommended owner/action: AgentBaker E2E test-infra owner: clean/quarantine �be2e-azure-networkisolated-v3-d6cc9, remove the stuck NSG/subnet association, and harden shared-cluster cleanup/retry.

Wiki signature:
etworkisolated-shared-cluster-delete-blocked-inuse-nsg (source of truth)

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

+	// GB200 / GB300 (Grace-Blackwell NVL) need fabric manager for NVLink/NVSwitch.
+	"standard_nd128isr_ndr_gb200_v6": true,
+	"standard_nd128isr_gb300_v6":     true,


aks-node-assistant · 2026-06-19T21:12:28Z

AgentBaker Linux gate detective

Run: https://msazure.visualstudio.com/CloudNativeCompute/_build/results?buildId=168756915
Failed job/stage/task: �uild2204gen2containerd / CIS validation
Summary: Build VHD reported CIS regressions detected (1) in the Ubuntu 22.04 Gen2 containerd lane. The Ubuntu 24.04 SSH service messages were warnings and do not appear to be the blocking failure.
Likely cause/signature: known Ubuntu 22.04 CIS 6.1.3.1 logfile compliance pass-to-fail regression, tracked as linux-vhd-prgate-cis-ubuntu2204-gen2-containerd-6131-logfiles.
Confidence: high.
Strongest alternative: PR #8745's GPU ARM64 changes affected shared provisioning and caused the CIS regression; less likely because the failing lane is generic Ubuntu 22.04 Gen2 containerd and the signature already has many prior occurrences.
Recommended next action/owner: VHD/CIS hardening owner: continue repair item https://msazure.visualstudio.com/CloudNativeCompute/_workitems/edit/38501652 and inspect cis-regressions.txt for the exact path/mode delta.
Wiki: linux-vhd-prgate-cis-ubuntu2204-gen2-containerd-6131-logfiles — https://dev.azure.com/msazure/09706533-03bf-4b43-9a9b-b49c75429646/_wiki/wikis/ed4a85e9-1085-4151-a39b-2753523eba2b?pagePath=%2FAKS%2FSIGs%20and%20Teams%2FAKS%20Components%2FSIG%3A%20Node%20Lifecycle%2FAI%20Agent%20Knowledge%2FAgentBaker%20Gate%20PR%20Pipeline%20Flakiness

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

+# For Ubuntu, pre-pull the CUDA driver image.
+# Skip GB images: they bake the GPU driver via apt in the NVIDIA_GB block below. All other Ubuntu
+# images (amd64 and arm64) cache aks-gpu-cuda so CSE can install the driver at node boot.
+if [ $OS = $UBUNTU_OS_NAME ] && ! grep -q "NVIDIA_GB" <<< "$FEATURE_FLAGS"; then  # incl. ARM64 (e.g. GB200/300 on vanilla)
  gpu_action="copy"


+	// GB200 / GB300 (Grace-Blackwell NVL) need fabric manager for NVLink/NVSwitch.
+	"standard_nd128isr_ndr_gb200_v6": true,
+	"standard_nd128isr_gb300_v6":     true,
 	// A100 oddballs.


aks-node-assistant · 2026-06-19T23:35:21Z

AgentBaker Linux gate detective

Run: https://msazure.visualstudio.com/CloudNativeCompute/_build/results?buildId=168796666
Failed job/stage/task: Run AgentBaker E2E
Summary: E2E failed because shared dual-stack cluster �be2e-azure-overlay-dualstack-v6-4a24b was unexpectedly Deleting across ACL, Ubuntu, and AzureLinuxV3 SecondaryNIC dual-stack scenarios.
Likely cause/signature: known shared E2E cluster/fleet outage, shared-cluster-fleet-outage.
Confidence: high.
Strongest alternative: PR #8745's GPU bake-gate changes caused the failure; less likely because the failing tests are shared-cluster SecondaryNIC dual-stack scenarios and the same cluster state failure occurs across unrelated PRs.
Recommended next action/owner: E2E shared-cluster infra owners; continue repair item https://msazure.visualstudio.com/CloudNativeCompute/_workitems/edit/38373323.
Wiki: shared-cluster-fleet-outage — https://dev.azure.com/msazure/09706533-03bf-4b43-9a9b-b49c75429646/_wiki/wikis/ed4a85e9-1085-4151-a39b-2753523eba2b?pagePath=%2FAKS%2FSIGs%20and%20Teams%2FAKS%20Components%2FSIG%3A%20Node%20Lifecycle%2FAI%20Agent%20Knowledge%2FAgentBaker%20Gate%20PR%20Pipeline%20Flakiness

Copilot AI review requested due to automatic review settings June 18, 2026 23:23

Copilot started reviewing on behalf of xuexu6666 June 18, 2026 23:24 View session

Copilot AI reviewed Jun 18, 2026

View reviewed changes

Comment thread pkg/agent/datamodel/gpu_components.go

Comment on lines +162 to +164

// GB200 / GB300 (Grace-Blackwell NVL) need fabric manager for NVLink/NVSwitch.

"standard_nd128isr_ndr_gb200_v6": true,

"standard_nd128isr_gb300_v6": true,

docs: comment GB200/300 (not just GB300) on vanilla arm64 bake

6131995

xuexu6666 marked this pull request as ready for review June 19, 2026 20:32

Copilot AI review requested due to automatic review settings June 19, 2026 20:32

xuexu6666 requested review from AbelHu, Devinwong, SriHarsha001, awesomenix, calvin197, cameronmeissner, djsly, ganeshkumarashok, lilypan26, mxj220, pdamianov-dev, phealy, r2k1, runzhen, sulixu, surajssd, timmy-wright and zachary-bailey as code owners June 19, 2026 20:32

Copilot started reviewing on behalf of xuexu6666 June 19, 2026 20:33 View session

test: add unit coverage for GPUNeedsFabricManager (incl. GB200/GB300)

4bce826

Copilot AI reviewed Jun 19, 2026

View reviewed changes

test: assert mixed-case GB200/GB300 VM sizes need fabricmanager

9358ad0

Copilot AI review requested due to automatic review settings June 19, 2026 21:35

Copilot started reviewing on behalf of xuexu6666 June 19, 2026 21:35 View session

Copilot AI reviewed Jun 19, 2026

View reviewed changes

fix: quote string comparison in GPU bake gate (SC2086)

203a339

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: enable GPU on vanilla arm64 VHD for GB300 (aks-gpu + fabric manager)#8745

feat: enable GPU on vanilla arm64 VHD for GB300 (aks-gpu + fabric manager)#8745
xuexu6666 wants to merge 5 commits into
mainfrom
xuex/gb300-vanilla-aksgpu-experiment

xuexu6666 commented Jun 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

aks-node-assistant Bot commented Jun 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

aks-node-assistant Bot commented Jun 19, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

aks-node-assistant Bot commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

xuexu6666 commented Jun 18, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

aks-node-assistant Bot commented Jun 19, 2026

AgentBaker Linux gate detective

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

aks-node-assistant Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AgentBaker Linux gate detective

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

aks-node-assistant Bot commented Jun 19, 2026

AgentBaker Linux gate detective

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aks-node-assistant Bot commented Jun 19, 2026 •

edited

Loading