OCPBUGS-78500: Skew enforcement should dynamically handle baremetal clusters#5768
OCPBUGS-78500: Skew enforcement should dynamically handle baremetal clusters#5768djoshy wants to merge 4 commits intoopenshift:mainfrom
Conversation
|
@djoshy: This pull request references Jira Issue OCPBUGS-78500, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughAdds a dynamic informer for the metal3 Provisioning CR, updates boot-image-skew enforcement logic to detect BareMetal legacy provisioning and SNO, suppresses a node metric for BareMetal when enforcement is None, and updates unit and extended tests to cover the new behavior. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
|
Skipping CI for Draft Pull Request. |
|
/test verify |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: djoshy The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/test unit |
1 similar comment
|
/test unit |
|
/test e2e-gcp-op-part1 |
e74d32b to
3096749
Compare
|
/payload-job periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-metal-ipi-ovn-ipv4-mco-disruptive-1of2 |
|
@djoshy: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/014f0720-23c2-11f1-9f14-c7e5b1502247-0 |
|
@djoshy: This pull request references Jira Issue OCPBUGS-78500, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
3096749 to
14682cd
Compare
|
/payload-job periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-metal-ipi-ovn-ipv4-mco-disruptive-1of2 periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-metal-ipi-ovn-ipv4-mco-disruptive-2of2 |
|
@djoshy: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/5419a430-2456-11f1-92f6-dd80a1bb2cc0-0 |
14682cd to
30925b3
Compare
|
/payload-abort |
|
/payload-job periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-metal-ipi-ovn-ipv4-mco-disruptive-1of2 periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-metal-ipi-ovn-ipv4-mco-disruptive-2of2 |
|
@djoshy: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/e6f8d2d0-2456-11f1-9d18-84974aa46967-0 |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pkg/operator/sync_test.go (1)
321-323: The empty-URL scenario never creates aProvisioningCR.
Line 801only inserts the fake object when the URL is non-empty, so the""case exercises theIsNotFoundfallback rather than the documented machine-os-images path where the CR exists and.spec.provisioningOSDownloadURLis empty. Split “CR presence” from “URL value” so both behaviors are covered.Also applies to: 799-808
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/operator/sync_test.go` around lines 321 - 323, The test currently uses the single variable provisioningOSDownloadURL to both decide whether to insert a fake Provisioning CR and to control its .spec.provisioningOSDownloadURL value, so the empty-string case never creates a CR and instead hits the IsNotFound fallback; update the test setup to separate "createCR" from "urlValue" (e.g., keep provisioningOSDownloadURL as the URL value and add a boolean like createProvisioningCR or explicitly call the Provisioning CR insertion unconditionally when you want the CR present) and then add/adjust cases so one scenario creates a Provisioning object with spec.provisioningOSDownloadURL == "" and another scenario omits the CR entirely, modifying the code path that currently inserts the fake Provisioning CR (the test section that populates the fake Provisioning lister) to use the new boolean/explicit control.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/operator/sync.go`:
- Around line 2641-2652: The function isBareMetalOnLegacyProvisioningPath should
treat a cold provisioning informer cache conservatively: check whether the
provisioningLister cache is synced (or detect cache not synced) and return true
(legacy/safe) if it is not yet synced; when optr.provisioningLister.Get returns
an unexpected error, keep the current warning but return true; call
unstructured.NestedString and inspect its ok and err return values instead of
discarding them (do not use (_,_,_)), and return true if parsing fails or the
field is malformed; update logic around provisioningLister.Get,
unstructured.NestedString, and the return value so missing/malformed data or
unsynced cache all default to true.
---
Nitpick comments:
In `@pkg/operator/sync_test.go`:
- Around line 321-323: The test currently uses the single variable
provisioningOSDownloadURL to both decide whether to insert a fake Provisioning
CR and to control its .spec.provisioningOSDownloadURL value, so the empty-string
case never creates a CR and instead hits the IsNotFound fallback; update the
test setup to separate "createCR" from "urlValue" (e.g., keep
provisioningOSDownloadURL as the URL value and add a boolean like
createProvisioningCR or explicitly call the Provisioning CR insertion
unconditionally when you want the CR present) and then add/adjust cases so one
scenario creates a Provisioning object with spec.provisioningOSDownloadURL == ""
and another scenario omits the CR entirely, modifying the code path that
currently inserts the fake Provisioning CR (the test section that populates the
fake Provisioning lister) to use the new boolean/explicit control.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 41155044-6c69-43e1-a716-65416f84b74d
📒 Files selected for processing (6)
pkg/controller/node/node_controller.gopkg/operator/operator.gopkg/operator/sync.gopkg/operator/sync_test.gotest/extended-priv/const.gotest/extended-priv/mco_bootimages_skew.go
95b5b59 to
99af585
Compare
|
/payload-abort |
|
/payload-job periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-metal-ipi-ovn-ipv4-mco-disruptive-1of2 periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-metal-ipi-ovn-ipv4-mco-disruptive-2of2 |
|
@djoshy: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/dccb8580-245d-11f1-9da3-2f1f1fadda57-0 |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/operator/sync_test.go (1)
798-819: Test harness correctly mirrors production setup.The dynamic lister setup properly uses the
metal3.io/v1alpha1/provisioningsGVR matching production code.One observation: when
provisioningOSDownloadURLis empty, no Provisioning CR is added to the indexer (testing the "CR not found" scenario). Consider also adding a test case where the CR exists but has an emptyprovisioningOSDownloadURLfield, to ensure both "not found" and "empty URL" scenarios produce the expectedNoneenforcement.💡 Optional: Add test case for CR-exists-with-empty-URL
{ name: "BareMetal platform, CR exists with empty provisioningOSDownloadURL, skew enforcement None expected", infra: buildInfra(withPlatformType(configv1.BareMetalPlatformType)), mcop: buildMachineConfigurationWithNoBootImageConfiguration(), clusterVersion: buildClusterVersion("4.18.0"), annotationExpected: false, provisioningOSDownloadURL: "PRESENT_BUT_EMPTY", // special sentinel expectedManagedBootImagesStatus: opv1.ManagedBootImages{}, expectedSkewEnforcementStatus: apihelpers.GetSkewEnforcementStatusNone(), },This would require updating the harness to handle a sentinel value that adds the CR with an empty URL field.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/operator/sync_test.go` around lines 798 - 819, The test currently only covers "CR missing" vs "CR present with non-empty URL"; add a test case that represents "CR exists but provisioningOSDownloadURL is empty" and update the provisioningIndexer creation logic to recognize a sentinel (e.g., provisioningOSDownloadURL == "PRESENT_BUT_EMPTY") and in that branch Add an Unstructured Provisioning object whose spec.provisioningOSDownloadURL is the empty string; ensure the new case sets provisioningOSDownloadURL to the sentinel and asserts expectedManagedBootImagesStatus and expectedSkewEnforcementStatus (e.g., GetSkewEnforcementStatusNone()) so the Operator (provisioningLister / provisioningIndexer) is exercised for the empty-URL path.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/operator/sync_test.go`:
- Around line 798-819: The test currently only covers "CR missing" vs "CR
present with non-empty URL"; add a test case that represents "CR exists but
provisioningOSDownloadURL is empty" and update the provisioningIndexer creation
logic to recognize a sentinel (e.g., provisioningOSDownloadURL ==
"PRESENT_BUT_EMPTY") and in that branch Add an Unstructured Provisioning object
whose spec.provisioningOSDownloadURL is the empty string; ensure the new case
sets provisioningOSDownloadURL to the sentinel and asserts
expectedManagedBootImagesStatus and expectedSkewEnforcementStatus (e.g.,
GetSkewEnforcementStatusNone()) so the Operator (provisioningLister /
provisioningIndexer) is exercised for the empty-URL path.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 2ba35981-8341-4b35-b398-0faf00671b0b
📒 Files selected for processing (5)
pkg/operator/operator.gopkg/operator/sync.gopkg/operator/sync_test.gotest/extended-priv/const.gotest/extended-priv/mco_bootimages_skew.go
✅ Files skipped from review due to trivial changes (1)
- test/extended-priv/const.go
🚧 Files skipped from review as they are similar to previous changes (3)
- test/extended-priv/mco_bootimages_skew.go
- pkg/operator/operator.go
- pkg/operator/sync.go
99af585 to
9971d2f
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (2)
pkg/operator/sync.go (1)
2552-2553: Delay the install-version lookup until it is actually needed.
getOCPVersionFromClusterVersion()is called before branches that immediately returnNone, so it can do unnecessary lister work and emit warnings for statuses that never use an OCP version.Suggested adjustment
- // Grab install time OCP version - ocpVersionAtInstall := optr.getOCPVersionFromClusterVersion() - // Spec override takes priority over all platform defaults. if mcop.Spec.BootImageSkewEnforcement != (opv1.BootImageSkewEnforcementConfig{}) { if mcop.Spec.BootImageSkewEnforcement.Mode == opv1.BootImageSkewEnforcementConfigModeManual { newMachineConfigurationStatus.BootImageSkewEnforcementStatus = opv1.BootImageSkewEnforcementStatus{ Mode: opv1.BootImageSkewEnforcementModeStatusManual, @@ // SNO clusters do not scale; skew enforcement is not applicable. if infra.Status.ControlPlaneTopology == configv1.SingleReplicaTopologyMode { newMachineConfigurationStatus.BootImageSkewEnforcementStatus = apihelpers.GetSkewEnforcementStatusNone() return } + // Grab install time OCP version only when Automatic/Manual is needed. + ocpVersionAtInstall := optr.getOCPVersionFromClusterVersion() + // Platforms with automated boot image updates: mode follows ManagedBootImages configuration.Also applies to: 2575-2582
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/operator/sync.go` around lines 2552 - 2553, The current code eagerly calls getOCPVersionFromClusterVersion() (ocpVersionAtInstall) before early-return branches, causing unnecessary lister work and warnings; change this to compute the install OCP version only when it's actually required (move the getOCPVersionFromClusterVersion() call into the branches that construct statuses needing an install version or wrap it in a small lazy helper). Update the occurrences around the other spots noted (the block referenced at the later occurrence) so both places call getOCPVersionFromClusterVersion() only inside the code paths that build and emit OCP-version-bearing status objects instead of before early returns.pkg/operator/sync_test.go (1)
321-322: Add one case for an unsynced Provisioning cache.
pkg/operator/sync.goat Lines 2643-2645 has a safety-critical branch whenprovisioningListerSyncedis false, but this table never hits it. One case wiringprovisioningListerSynced: func() bool { return false }would protect that behavior.Also applies to: 809-830
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/operator/sync_test.go` around lines 321 - 322, Add a new table-driven test case in pkg/operator/sync_test.go that sets provisioningListerSynced to func() bool { return false } to exercise the unsynced Provisioning cache branch; in the new case keep other fields consistent (e.g., provisioningCRPresent and provisioningOSDownloadURL) and assert the behavior expected when provisioningListerSynced is false (the safety-critical branch in sync.go that handles an unsynced cache). Locate the test table used for sync tests (the one that currently iterates cases including provisioningCRPresent and provisioningOSDownloadURL) and add this case similarly to the other entries so the code path guarded by provisioningListerSynced is covered.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/operator/sync_test.go`:
- Around line 321-322: Add a new table-driven test case in
pkg/operator/sync_test.go that sets provisioningListerSynced to func() bool {
return false } to exercise the unsynced Provisioning cache branch; in the new
case keep other fields consistent (e.g., provisioningCRPresent and
provisioningOSDownloadURL) and assert the behavior expected when
provisioningListerSynced is false (the safety-critical branch in sync.go that
handles an unsynced cache). Locate the test table used for sync tests (the one
that currently iterates cases including provisioningCRPresent and
provisioningOSDownloadURL) and add this case similarly to the other entries so
the code path guarded by provisioningListerSynced is covered.
In `@pkg/operator/sync.go`:
- Around line 2552-2553: The current code eagerly calls
getOCPVersionFromClusterVersion() (ocpVersionAtInstall) before early-return
branches, causing unnecessary lister work and warnings; change this to compute
the install OCP version only when it's actually required (move the
getOCPVersionFromClusterVersion() call into the branches that construct statuses
needing an install version or wrap it in a small lazy helper). Update the
occurrences around the other spots noted (the block referenced at the later
occurrence) so both places call getOCPVersionFromClusterVersion() only inside
the code paths that build and emit OCP-version-bearing status objects instead of
before early returns.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: afef5a73-ce4e-4ab6-8c68-f3404a3404da
📒 Files selected for processing (5)
pkg/operator/operator.gopkg/operator/sync.gopkg/operator/sync_test.gotest/extended-priv/const.gotest/extended-priv/mco_bootimages_skew.go
✅ Files skipped from review due to trivial changes (1)
- test/extended-priv/const.go
🚧 Files skipped from review as they are similar to previous changes (2)
- test/extended-priv/mco_bootimages_skew.go
- pkg/operator/operator.go
|
@djoshy: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What I did
Added a
baremetalexception to boot image skew enforcement.baremetalclusters installed on 4.10+ use the machine-os-images provisioning path where boot images are CVO-managed and carry no skew risk. Thespec.provisioningOSDownloadURLfield on the Provisioning CR is used to distinguish the two paths. This CR is accessed via a dynamic informer/lister to avoid additional vendors. TheMCCBootImageSkewEnforcementNonealert is suppressed forbaremetalcases, matching existing SNO behavior.This PR also does a refactor of
syncBootImageSkewEnforcementStatusto make it easier to read and to add exclusions like this in the future.How to verify it
machine-os-images path (default on 4.10+):
MachineConfigurationstatus should show that thebootImageSkewEnforcementStatus.modefield is set toNone.MCCBootImageSkewEnforcementNonealert should not be firing.Legacy qcow2 path (simulated):
bootImageSkewEnforcementStatus.modeshould now flip toManual.bootImageSkewEnforcementStatus.modeshould now flip back toNone.