Skip to content

fix(vmclass): recompute cpu features so vms can schedule on listed available nodes#2501

Open
fl64 wants to merge 1 commit into
mainfrom
fix/vmclass/stale-discovery-features
Open

fix(vmclass): recompute cpu features so vms can schedule on listed available nodes#2501
fl64 wants to merge 1 commit into
mainfrom
fix/vmclass/stale-discovery-features

Conversation

@fl64

@fl64 fl64 commented Jun 17, 2026

Copy link
Copy Markdown
Member

Why do we need it, and what problem does it solve?

A VirtualMachineClass with cpu.type: Discovery could become impossible to schedule VMs onto after the cluster's node composition changed (new worker nodes joined, an old generation drained, etc.). The user-visible symptom was a VM stuck in Pending with the condition:

Could not schedule the virtual machine: Unschedulable: 0/N nodes are available:
N node(s) didn't match Pod's node affinity/selector.

The class looked healthy (Status.AvailableNodes listed matching nodes, the Discovered condition was True), but the virt-launcher pod's nodeSelector required CPU feature labels — such as hle/rtm from an Intel Haswell generation — that the current worker nodes no longer provided. Those features had been captured from the very first reconcile and were never refreshed.

What is the expected result?

After the fix the controller recomputes common CPU features from the current availableNodes on every reconcile, so:

  • VM classes correctly advertise only features present on every node they can run on.
  • Adding or removing worker nodes (drain, decommission, scale-out) no longer leaves stale feature requirements behind.
  • VMs that used to be stuck in Pending start scheduling as soon as the next reconcile runs.

Checklist

  • The code is covered by unit tests.
  • e2e tests passed.
  • Documentation updated according to the changes.
  • Changes were tested in the Kubernetes cluster manually.
section: core
type: fix
summary: "VirtualMachineClass with cpu.type=Discovery no longer keeps stale CPU features captured at first reconcile; features are recomputed from the current set of available nodes on every reconcile."

@fl64 fl64 force-pushed the fix/vmclass/stale-discovery-features branch from 1209532 to fe2b839 Compare June 17, 2026 09:46
@fl64 fl64 changed the title fix(vmclass): recompute cpu features on every reconcile fix(vmclass): refresh discovered cpu features so vms schedule on the current nodes Jun 17, 2026
@fl64 fl64 changed the title fix(vmclass): refresh discovered cpu features so vms schedule on the current nodes fix(vmclass): keep discovered cpu features in sync with current worker nodes Jun 17, 2026
@fl64 fl64 force-pushed the fix/vmclass/stale-discovery-features branch from fe2b839 to 73e6cb4 Compare June 17, 2026 10:04
@fl64 fl64 changed the title fix(vmclass): keep discovered cpu features in sync with current worker nodes fix(vmclass): keep discovered cpu features in sync with current nodes Jun 17, 2026
@fl64 fl64 force-pushed the fix/vmclass/stale-discovery-features branch from 73e6cb4 to 191359f Compare June 17, 2026 10:07
…ailable nodes

VirtualMachineClass with cpu.type=Discovery kept CPU features from the
very first reconcile in Status.CpuFeatures.Enabled. When nodes were
added or drained later, the cached feature list was never updated, so
the class kept advertising features that no longer existed on
availableNodes. This caused VMs stuck in Pending because the virt-launcher
pod required labels that no node provided.

Compute features from availableNodes on every reconcile so the list always
matches the current node set filtered by spec.nodeSelector.

Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>
@fl64 fl64 changed the title fix(vmclass): keep discovered cpu features in sync with current nodes fix(vmclass): recompute cpu features so vms can schedule on listed available nodes Jun 17, 2026
@fl64 fl64 force-pushed the fix/vmclass/stale-discovery-features branch from 191359f to bb87e01 Compare June 17, 2026 10:10
@fl64 fl64 added this to the v1.10.0 milestone Jun 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant