Skip to content

feat(anc): gate check-hotfix on enable_provisioning_hotfix contract field#8717

Draft
Devinwong wants to merge 1 commit into
devinwong/anc-wire-check-hotfix-wrapperfrom
devinwong/anc-hotfix-env-delivery
Draft

feat(anc): gate check-hotfix on enable_provisioning_hotfix contract field#8717
Devinwong wants to merge 1 commit into
devinwong/anc-wire-check-hotfix-wrapperfrom
devinwong/anc-hotfix-env-delivery

Conversation

@Devinwong

@Devinwong Devinwong commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

2.1d - gate check-hotfix on the enable_provisioning_hotfix contract field

POC / M1 draft. AgentBaker / Node SIG side only.

This is the final layer of the provisioning-hotfix stack. It makes the AKSNodeConfig
contract field the single source of truth for whether aks-node-controller check-hotfix
does any work, and relaxes the env gate added in 2.1c.

What changed

  • Proto contract field: add bool enable_provisioning_hotfix = 45; to
    aksnodeconfig/v1/config.proto (next free tag after cse_timeout = 44) and regenerate
    the Go bindings.
  • Go gate: check-hotfix reads the field at the very top of checkHotfix() via
    App.provisioningHotfixEnabled() (reads the node-config JSON that is already on disk and
    calls GetEnableProvisioningHotfix()). When the field is not true (false, unset, or the
    config cannot be read/parsed) it returns the new telemetry outcome disabled and exits 0
    WITHOUT any remote hotfix call. Fail-open everywhere.
  • Wrapper relaxation: aks-node-controller-wrapper.sh now calls check-hotfix
    UNCONDITIONALLY (still wrapped defensively so it can never block provisioning). The
    Go binary self-gates on the contract field.

Read channel

The hotfix-pointer READ CHANNEL is moving from a kube-system ConfigMap (read with a bootstrap
token) to the live-patching-service (LPS) IMDS-attested endpoint, validated reachable pre-kubelet
in e2e. That fetch/auth rewrite lives in #8696; this PR is channel-agnostic. The
enable_provisioning_hotfix contract field and the Go self-gate decide WHETHER check-hotfix
runs at all, independent of which channel it then uses, so the proto field is intentionally
channel-neutral and unchanged by the pivot.

Supersedes the env-delivery approach

An earlier revision of this PR delivered the toggle as an env var via a cse_cmd.sh
template var plus a systemd drop-in (Environment="ENABLE_PROVISIONING_HOTFIX=...") on
aks-node-controller.service, mirroring the IMDS-restriction pattern. That approach was
dropped because:

  • check-hotfix already parses the AKSNodeConfig for its own connection details, so a real
    contract field is available to the binary with zero new plumbing -
    no template var, no drop-in, no env var.
  • In the self-provisioning path the wrapper and the drop-in writer are the same service, so an
    env/drop-in written during provisioning would only take effect on the NEXT boot. Reading the
    contract field directly avoids that activation-timing problem - it works on the same boot
    because the config JSON is on disk before the service starts.

This also means absvc sets ONE field (the contract bool), not an env var plus a field.

Relaxes the 2.1c env gate

This PR relaxes the ENABLE_PROVISIONING_HOTFIX env gate introduced in #8715 (2.1c); gating
now lives in the Go binary via the enable_provisioning_hotfix contract field - single source
of truth, so absvc sets ONE field, not an env var plus a field. The 2.1c env gate is
intentionally added-then-relaxed across the stack so each PR stays reviewable on its own.

Default-off and fail-open

When enable_provisioning_hotfix is false or unset, behavior is byte-identical to before this
stack: check-hotfix makes no remote call and provisioning proceeds unchanged. Any read or
parse error is treated as off. This preserves the 6-month VHD support window in both directions
(older VHD + newer config, and newer VHD + older binary are both safe).

Before / after

Stack

main
 \- #8694  2.1a  base->version hotfix map (Go)
     \- #8696  2.1b  check-hotfix LPS endpoint reader (Go)
         \- #8715  2.1c  wire check-hotfix into wrapper (shell)
             \- #8717  2.1d  enable_provisioning_hotfix contract field + Go self-gate   <- this PR

The aks-rp region toggle that sets the field is in a different repo and is the only remaining
out-of-repo piece. With the field settable on a node, the on-node PoC e2e tests (fail-open and
multi-base) become runnable.

Tests

  • go test ./... in aks-node-controller: all check-hotfix tests pass, including new gate
    tests (disabled -> outcome=disabled and the injected fetcher is never called; enabled ->
    fetch path runs). Pre-existing Windows-only failures (CRLF goldens, file locks, os-release
    message text) are unrelated and also fail on the base branch.
  • Wrapper shellspec updated for unconditional check-hotfix: 7 examples, 0 failures.
  • shellcheck clean on the wrapper.

@github-actions

github-actions Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

The latest Buf updates on your PR. Results from workflow Buf CI / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed❌ failed (1)✅ passed✅ passedJun 23, 2026, 1:00 AM

@Devinwong Devinwong force-pushed the devinwong/anc-hotfix-env-delivery branch from 2d4b37d to 59b7cef Compare June 16, 2026 01:33
@Devinwong Devinwong changed the title feat(anc): deliver ENABLE_PROVISIONING_HOTFIX to node-controller via contract field + systemd drop-in feat(anc): gate check-hotfix on enable_provisioning_hotfix contract field Jun 16, 2026
@Devinwong Devinwong force-pushed the devinwong/anc-hotfix-env-delivery branch from 59b7cef to d80ae7d Compare June 16, 2026 02:21
@Devinwong

Copy link
Copy Markdown
Collaborator Author

Acknowledged - no action needed. This is the automated Buf CI status, and it reports Build, Format, Lint, and Breaking all passing for the additive optional field enable_provisioning_hotfix = 45. That matches the local buf verification (lint STANDARD clean, WIRE_JSON breaking clean against the 2.1c base, format clean), so the proto change is confirmed compatible and no change is warranted.

@Devinwong

Copy link
Copy Markdown
Collaborator Author

Read-channel pivot note (no change to this PR's gating contract).

The hotfix-pointer read channel is moving from Option 2 (kube-system ConfigMap + bootstrap token) to Option 4: the live-patching-service (LPS) IMDS-attested endpoint, which e2e validated is reachable pre-kubelet. That fetch/auth rewrite lives in #8696 and is channel-specific.

This PR (2.1d) is channel-agnostic. The enable_provisioning_hotfix contract field (proto tag 45, plain bool) and the Go self-gate (App.provisioningHotfixEnabled() at the top of checkHotfix(), default-off, fail-open, no remote call when off) decide WHETHER check-hotfix runs at all, independent of which channel it then uses. So:

  • The proto field name stays enable_provisioning_hotfix (channel-neutral, correct).
  • Gate semantics are unchanged: unset/false/nil-config all resolve to disabled.
  • I scrubbed wording in the proto comment, the gate comment + log line, the wrapper comment, and the PR body so they describe the gated action as reading the hotfix pointer from the LPS endpoint rather than the ConfigMap.
  • buf lint and breaking are both clean (the comment-only proto change leaves rawDesc untouched); generated surface is config.pb.go only.

@Devinwong Devinwong force-pushed the devinwong/anc-hotfix-env-delivery branch from d552a0a to 21ae228 Compare June 20, 2026 00:36
@Devinwong Devinwong marked this pull request as draft June 20, 2026 01:20
…ield

Replaces the env-delivery approach (systemd drop-in + cse_cmd.sh) with a single
contract field. check-hotfix self-gates on the new AKSNodeConfig field
enable_provisioning_hotfix (proto tag 45, optional bool); when it is not true the
command no-ops with telemetry outcome=disabled and makes no apiserver call.
Default-off, fail-open.

Relaxes the ENABLE_PROVISIONING_HOTFIX env gate introduced in 2.1c so the wrapper
calls check-hotfix unconditionally; gating now lives in the Go binary via the
contract field as the single source of truth.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Devinwong Devinwong force-pushed the devinwong/anc-wire-check-hotfix-wrapper branch from 9b0f1fd to abdcd9f Compare June 23, 2026 01:00
@Devinwong Devinwong force-pushed the devinwong/anc-hotfix-env-delivery branch from 21ae228 to 5a5b74a Compare June 23, 2026 01:00
@github-actions

Copy link
Copy Markdown
Contributor

Changes cached containers or packages on windows VHDs

Please get a Windows SIG member to approve.

The following dif file shows any additions or deletions from what will be cached on windows VHDs organised by VHD type.

  • Additions are new things cached.
  • Deletions are things no longer cached.
diff --git a/vhd_files/2022-containerd-gen2.txt b/vhd_files/2022-containerd-gen2.txt
index db10c9e..c51a47f 100644
--- a/vhd_files/2022-containerd-gen2.txt
+++ b/vhd_files/2022-containerd-gen2.txt
@@ -122,0 +123 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.34.6-windows-hp
+mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.2-windows-hp
@@ -124 +124,0 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.3-windows-hp
-mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.4-windows-hp
@@ -129,0 +130 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/secrets-store/driver:v1.5.4
+mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.33.11-windows-hpc-1
@@ -131 +131,0 @@ mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.33.13-windows-hp
-mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.33.14-windows-hpc-1
@@ -133 +133,2 @@ mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.34.10-windows-hp
-mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.34.11-windows-hpc-1
+mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.34.8-windows-hpc-1
+mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.35.3-windows-hpc-1
@@ -135 +135,0 @@ mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.35.5-windows-hpc
-mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.35.6-windows-hpc-1
@@ -137 +136,0 @@ mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.36.1-windows-hpc
-mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.36.2-windows-hpc-1
diff --git a/vhd_files/2022-containerd.txt b/vhd_files/2022-containerd.txt
index 94de353..7312c49 100644
--- a/vhd_files/2022-containerd.txt
+++ b/vhd_files/2022-containerd.txt
@@ -122,0 +123 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.34.6-windows-hp
+mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.2-windows-hp
@@ -124 +124,0 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.3-windows-hp
-mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.4-windows-hp
@@ -129,0 +130 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/secrets-store/driver:v1.5.4
+mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.33.11-windows-hpc-1
@@ -131 +131,0 @@ mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.33.13-windows-hp
-mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.33.14-windows-hpc-1
@@ -133 +133,2 @@ mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.34.10-windows-hp
-mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.34.11-windows-hpc-1
+mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.34.8-windows-hpc-1
+mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.35.3-windows-hpc-1
@@ -135 +135,0 @@ mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.35.5-windows-hpc
-mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.35.6-windows-hpc-1
@@ -137 +136,0 @@ mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.36.1-windows-hpc
-mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.36.2-windows-hpc-1
diff --git a/vhd_files/2025-gen2.txt b/vhd_files/2025-gen2.txt
index d0ea692..36e3641 100644
--- a/vhd_files/2025-gen2.txt
+++ b/vhd_files/2025-gen2.txt
@@ -52,0 +53 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.34.6-windows-hp
+mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.2-windows-hp
@@ -54 +54,0 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.3-windows-hp
-mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.4-windows-hp
@@ -59,0 +60 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/secrets-store/driver:v1.5.4
+mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.33.11-windows-hpc-1
@@ -61 +61,0 @@ mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.33.13-windows-hp
-mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.33.14-windows-hpc-1
@@ -63 +63,2 @@ mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.34.10-windows-hp
-mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.34.11-windows-hpc-1
+mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.34.8-windows-hpc-1
+mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.35.3-windows-hpc-1
@@ -65 +65,0 @@ mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.35.5-windows-hpc
-mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.35.6-windows-hpc-1
@@ -67 +66,0 @@ mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.36.1-windows-hpc
-mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.36.2-windows-hpc-1
diff --git a/vhd_files/2025.txt b/vhd_files/2025.txt
index ab44d8b..b8873d5 100644
--- a/vhd_files/2025.txt
+++ b/vhd_files/2025.txt
@@ -52,0 +53 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.34.6-windows-hp
+mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.2-windows-hp
@@ -54 +54,0 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.3-windows-hp
-mcr.microsoft.com/oss/v2/kubernetes-csi/azurefile-csi:v1.35.4-windows-hp
@@ -59,0 +60 @@ mcr.microsoft.com/oss/v2/kubernetes-csi/secrets-store/driver:v1.5.4
+mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.33.11-windows-hpc-1
@@ -61 +61,0 @@ mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.33.13-windows-hp
-mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.33.14-windows-hpc-1
@@ -63 +63,2 @@ mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.34.10-windows-hp
-mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.34.11-windows-hpc-1
+mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.34.8-windows-hpc-1
+mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.35.3-windows-hpc-1
@@ -65 +65,0 @@ mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.35.5-windows-hpc
-mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.35.6-windows-hpc-1
@@ -67 +66,0 @@ mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.36.1-windows-hpc
-mcr.microsoft.com/oss/v2/kubernetes/azure-cloud-node-manager:v1.36.2-windows-hpc-1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant