fix(vmclass): recompute cpu features so vms can schedule on listed available nodes#2501
Open
fl64 wants to merge 1 commit into
Open
fix(vmclass): recompute cpu features so vms can schedule on listed available nodes#2501fl64 wants to merge 1 commit into
fl64 wants to merge 1 commit into
Conversation
1209532 to
fe2b839
Compare
fe2b839 to
73e6cb4
Compare
73e6cb4 to
191359f
Compare
…ailable nodes VirtualMachineClass with cpu.type=Discovery kept CPU features from the very first reconcile in Status.CpuFeatures.Enabled. When nodes were added or drained later, the cached feature list was never updated, so the class kept advertising features that no longer existed on availableNodes. This caused VMs stuck in Pending because the virt-launcher pod required labels that no node provided. Compute features from availableNodes on every reconcile so the list always matches the current node set filtered by spec.nodeSelector. Signed-off-by: Pavel Tishkov <pavel.tishkov@flant.com>
191359f to
bb87e01
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why do we need it, and what problem does it solve?
A
VirtualMachineClasswithcpu.type: Discoverycould become impossible to schedule VMs onto after the cluster's node composition changed (new worker nodes joined, an old generation drained, etc.). The user-visible symptom was a VM stuck inPendingwith the condition:The class looked healthy (
Status.AvailableNodeslisted matching nodes, theDiscoveredcondition wasTrue), but the virt-launcher pod'snodeSelectorrequired CPU feature labels — such ashle/rtmfrom an Intel Haswell generation — that the current worker nodes no longer provided. Those features had been captured from the very first reconcile and were never refreshed.What is the expected result?
After the fix the controller recomputes common CPU features from the current
availableNodeson every reconcile, so:Pendingstart scheduling as soon as the next reconcile runs.Checklist