Skip to content

OCPBUGS-86588: vSphere boot image hot loop detection is non-functional due to stable template names#6094

Open
djoshy wants to merge 1 commit into
openshift:mainfrom
djoshy:fix-vsphere-template-naming
Open

OCPBUGS-86588: vSphere boot image hot loop detection is non-functional due to stable template names#6094
djoshy wants to merge 1 commit into
openshift:mainfrom
djoshy:fix-vsphere-template-naming

Conversation

@djoshy
Copy link
Copy Markdown
Contributor

@djoshy djoshy commented May 27, 2026

- What I did
vSphere template names are derived solely from <infraID>-rhcos-<failureDomain.Name>, which never changes between upgrades. This meant the hot loop detector always saw identical provider spec bytes, making it unable to distinguish a real upgrade from a patch loop. Fixed by storing the RHCOS version associated with that update for hot loop detection for vSphere. All other platforms will use the traditional path of using providerSpec bytes. I have also broken down the hotloop detection check so the read and write happen in separate routines.

- How to verify it
vSphere disruptive e2es should pass and not cause degrades when the installer artifacts are out of sync( https://redhat.atlassian.net/browse/OCPBUGS-86475 should stop happening, but this is not a fix for the underlying issue of the configmap bouncing, so I chose not to link this fix to that bug)

Summary by CodeRabbit

  • Refactor
    • Improved boot-image handling for vSphere by deriving image identity from stream metadata and making hot-loop detection read-only and deterministic.
  • Bug Fixes
    • Reconciliation now detects potential hot-loop scenarios before applying changes and records state after successful updates to avoid repeated churn.
  • Tests
    • Updated hot-loop tests to validate the revised checks and state-recording behavior.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels May 27, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@djoshy: This pull request references Jira Issue OCPBUGS-86588, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

- What I did
vSphere template names were derived solely from <infraID>-rhcos-<failureDomain.Name>, which never changes between upgrades. This meant the hot loop detector always saw identical provider spec bytes, making it unable to distinguish a real upgrade from a patch loop. Fixed by appending an FNV-1a hash of the OVA SHA256 to the template name, making it content-addressed.

- How to verify it
vSphere disruptive e2es should pass and not cause degrades when the installer artifacts are out of sync( https://redhat.atlassian.net/browse/OCPBUGS-86475 should stop happening, but this is not a fixz)

A more thorough test:
Trigger a vSphere boot image update and confirm the new template is created with a name of the form <infraID>-rhcos-<failureDomain.Name>-<hash>. Perform a second update with a different OVA and confirm the hash suffix changes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d3a9d8e1-58f3-4855-a6fd-907497bc22a1

📥 Commits

Reviewing files that changed from the base of the PR and between 35c629c and dbea5e5.

📒 Files selected for processing (2)
  • pkg/controller/bootimage/boot_image_controller_test.go
  • pkg/controller/bootimage/ms_helpers.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/controller/bootimage/boot_image_controller_test.go
  • pkg/controller/bootimage/ms_helpers.go

Walkthrough

Refactors hot-loop detection to derive comparison values from vSphere stream metadata, makes checks non-mutating, runs checks against the proposed MachineSet before patching, records state only after successful patches, and updates tests to the new check signature.

Changes

MAPI Hot-Loop Detection Enhancement

Layer / File(s) Summary
Boot-image value derivation
pkg/controller/bootimage/ms_helpers.go
Adds github.com/coreos/stream-metadata-go/stream import and getMAPIBootImageValue which prefers vmware OVA release from vSphere stream metadata for the given arch, falling back to ProviderSpec.Value.Raw.
Read-only hot-loop check & recorder
pkg/controller/bootimage/ms_helpers.go
Refactors checkMAPIMachineSetHotLoop to derive comparison bytes via getMAPIBootImageValue and be read-only; adds recordMAPIBootImageState which initializes or increments stored hot-loop counters only after successful patches when derived bytes match stored bytes.
Sync flow and test updates
pkg/controller/bootimage/ms_helpers.go, pkg/controller/bootimage/boot_image_controller_test.go
syncMAPIMachineSet now runs the non-mutating hot-loop check against the proposed newMachineSet before patching and calls recordMAPIBootImageState after a successful patchMachineSet. TestHotLoop updated to the new check signature, simulates post-patch state between iterations, and asserts final hot-loop behavior.

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 14 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Single Node Openshift (Sno) Test Compatibility ⚠️ Warning 11 new Ginkgo e2e tests in test/extended-priv/mco_bootimages.go scale MachineSets and assume multiple nodes without SNO compatibility checks. Add [Skipped:SingleReplicaTopology] labels to Ginkgo It() blocks or wrap in exutil.IsSingleNode() checks with Skip() to protect SNO deployments.
✅ Passed checks (14 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly identifies the bug (OCPBUGS-86588) and the core issue (vSphere boot image hot loop detection is non-functional due to stable template names), which aligns with the primary change across both modified files.
Docstring Coverage ✅ Passed Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All 46 test cases in boot_image_controller_test.go use static, deterministic test names with no dynamic values, UUIDs, timestamps, or generated identifiers.
Test Structure And Quality ✅ Passed Test file uses standard Go testing (t.Run subtests), not Ginkgo framework. Custom check is specific to Ginkgo with It/BeforeEach patterns.
Microshift Test Compatibility ✅ Passed No Ginkgo e2e tests were added. The PR only adds/modifies standard Go unit tests (boot_image_controller_test.go) and helper code (ms_helpers.go). The unit tests use "testing" framework, not Ginkgo.
Topology-Aware Scheduling Compatibility ✅ Passed Changes are internal controller logic for machine boot image hot-loop detection. No deployment manifests, operators, or scheduling constraints are introduced.
Ote Binary Stdout Contract ✅ Passed No process-level stdout writes found. Changes are in controller code with no fmt.Print, log.Print, or print() calls; klog usage occurs only within normal controller methods.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No Ginkgo e2e tests were added in this PR. Changes are limited to standard Go unit tests in boot_image_controller_test.go and helper functions in ms_helpers.go, neither of which use Ginkgo.
No-Weak-Crypto ✅ Passed No weak cryptography (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB) found. SHA256 usage is appropriate. bytes.Equal() used only for non-secret boot image identifiers.
Container-Privileges ✅ Passed PR modifies only Go source code files (2 controller logic files); no container manifests, Dockerfiles, or privilege-related settings are changed.
No-Sensitive-Data-In-Logs ✅ Passed All logging uses only non-sensitive metadata (machineSet names, timestamps, Kubernetes identifiers). No ProviderSpec, configMap content, or credentials are exposed in logs.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.12.2)

Command failed


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 27, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 27, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@djoshy: This pull request references Jira Issue OCPBUGS-86588, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

- What I did
vSphere template names were derived solely from <infraID>-rhcos-<failureDomain.Name>, which never changes between upgrades. This meant the hot loop detector always saw identical provider spec bytes, making it unable to distinguish a real upgrade from a patch loop. Fixed by appending an FNV-1a hash of the OVA SHA256 to the template name, making it content-addressed.

- How to verify it
vSphere disruptive e2es should pass and not cause degrades when the installer artifacts are out of sync( https://redhat.atlassian.net/browse/OCPBUGS-86475 should stop happening, but this is not a fix for the underlying issue of the configmap bouncing, so I chose not to link this fix to that bug)

A more thorough test:
Trigger a vSphere boot image update and confirm the new template is created with a name of the form <infraID>-rhcos-<failureDomain.Name>-<hash>. Perform a second update with a different OVA and confirm the hash suffix changes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 27, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: djoshy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 27, 2026
Comment thread pkg/controller/bootimage/vsphere_helpers.go Outdated
@djoshy djoshy force-pushed the fix-vsphere-template-naming branch from 5943240 to 99aba52 Compare May 27, 2026 18:29
@djoshy
Copy link
Copy Markdown
Contributor Author

djoshy commented May 27, 2026

/payload-job periodic-ci-openshift-release-main-nightly-5.0-e2e-vsphere-ovn-csi-vcf9
periodic-ci-openshift-machine-config-operator-release-5.0-periodics-e2e-vsphere-mco-disruptive

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 27, 2026

@djoshy: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-nightly-5.0-e2e-vsphere-ovn-csi-vcf9

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/022744a0-59fa-11f1-8a7a-1a6b8917247f-0

@djoshy
Copy link
Copy Markdown
Contributor Author

djoshy commented May 27, 2026

/payload-job periodic-ci-openshift-machine-config-operator-release-5.0-periodics-e2e-vsphere-mco-disruptive

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 27, 2026

@djoshy: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-machine-config-operator-release-5.0-periodics-e2e-vsphere-mco-disruptive

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/8f8d1540-59fa-11f1-81c8-01dd4b3aaccc-0

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
pkg/controller/bootimage/vsphere_helpers.go (1)

237-243: ⚡ Quick win

Fetch only the VM properties this helper actually uses.

Passing nil to vm.Properties asks vCenter for the entire mo.VirtualMachine, but this path only reads the product version and disk backing. Narrowing the property list keeps this reconciliation check cheaper and avoids pulling unrelated fields on every pass.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controller/bootimage/vsphere_helpers.go` around lines 237 - 243, The
vm.Properties call currently requests the entire mo.VirtualMachine by passing
nil; change it to request only the needed properties (at minimum
"summary.config.product.version" and the fields read by
getDiskTypeFromExistingVM, e.g. "config.hardware.device") so the function still
reads vmMo.Summary.Config.Product.Version and getDiskTypeFromExistingVM(vmMo)
but without pulling unrelated fields; replace the nil argument with a []string
literal listing these properties and add any additional specific property names
that getDiskTypeFromExistingVM requires.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/controller/bootimage/vsphere_helpers.go`:
- Around line 495-501: The current truncation can chop off the hash suffix in
the VM name; instead compute the hash suffix via ovaTemplateHash(ova.Sha256)
(including the leading dash), calculate allowedPrefix = 80 - len(hashSuffix),
truncate the base name (constructed from infraID and failureDomain.Name) to
allowedPrefix if needed, then set name = truncatedBase + hashSuffix so the
content-addressed hash is always preserved; update the len(name) > 80 branch
around the name assignment where infraID, failureDomain.Name and ovaTemplateHash
are used and replace the simple slice truncate with this suffix-preserving
truncation logic.

---

Nitpick comments:
In `@pkg/controller/bootimage/vsphere_helpers.go`:
- Around line 237-243: The vm.Properties call currently requests the entire
mo.VirtualMachine by passing nil; change it to request only the needed
properties (at minimum "summary.config.product.version" and the fields read by
getDiskTypeFromExistingVM, e.g. "config.hardware.device") so the function still
reads vmMo.Summary.Config.Product.Version and getDiskTypeFromExistingVM(vmMo)
but without pulling unrelated fields; replace the nil argument with a []string
literal listing these properties and add any additional specific property names
that getDiskTypeFromExistingVM requires.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 07e13907-706a-4655-960f-34d935c21b8d

📥 Commits

Reviewing files that changed from the base of the PR and between 5943240 and 99aba52.

📒 Files selected for processing (1)
  • pkg/controller/bootimage/vsphere_helpers.go

Comment thread pkg/controller/bootimage/vsphere_helpers.go Outdated
@djoshy
Copy link
Copy Markdown
Contributor Author

djoshy commented May 27, 2026

/payload-abort

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 27, 2026

@djoshy: aborted 1 active payload job(s) for pull request #6094

@djoshy djoshy force-pushed the fix-vsphere-template-naming branch from 99aba52 to f2279c6 Compare May 28, 2026 14:29
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/controller/bootimage/ms_helpers.go`:
- Around line 221-227: The code dereferences infra.Status.PlatformStatus without
checking it for nil, which can panic; update the conditional in the block that
reads infra, streamData and infra.Status.PlatformStatus.Type to also guard that
infra.Status.PlatformStatus != nil before comparing Type (i.e., ensure infra !=
nil && infra.Status.PlatformStatus != nil && streamData != nil &&
infra.Status.PlatformStatus.Type == osconfigv1.VSpherePlatformType), then
proceed to call streamData.GetArchitecture and read
streamArch.Artifacts["vmware"].Release to set value.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 536d1541-8258-466a-a44a-98c5940a4aa5

📥 Commits

Reviewing files that changed from the base of the PR and between 99aba52 and f2279c6.

📒 Files selected for processing (2)
  • pkg/controller/bootimage/boot_image_controller_test.go
  • pkg/controller/bootimage/ms_helpers.go

Comment thread pkg/controller/bootimage/ms_helpers.go Outdated
@djoshy djoshy force-pushed the fix-vsphere-template-naming branch 3 times, most recently from fdafdad to 1ee90b9 Compare May 28, 2026 14:46
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/controller/bootimage/ms_helpers.go`:
- Around line 209-214: The log "No patching required for MAPI machineset" is
executed even after a successful patch because control falls through when
patchRequired is true; update the logic in the function containing variables
patchRequired and machineSet so that the klog.Infof("No patching required for
MAPI machineset %s", machineSet.Name) line is only executed when patchRequired
is false (e.g., wrap it in an else branch or return immediately after a
successful patch), and ensure the hot-loop check via
ctrl.checkMAPIMachineSetHotLoop(newMachineSet, streamData, infra, arch) still
runs only when a patch was attempted.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b4bd28e0-6ffc-45cd-ba2f-53578060e1a0

📥 Commits

Reviewing files that changed from the base of the PR and between f2279c6 and 1ee90b9.

📒 Files selected for processing (2)
  • pkg/controller/bootimage/boot_image_controller_test.go
  • pkg/controller/bootimage/ms_helpers.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/controller/bootimage/boot_image_controller_test.go

Comment thread pkg/controller/bootimage/ms_helpers.go Outdated
@djoshy djoshy force-pushed the fix-vsphere-template-naming branch from 1ee90b9 to 35c629c Compare May 28, 2026 15:35
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Actionable comments posted: 0

@djoshy
Copy link
Copy Markdown
Contributor Author

djoshy commented May 28, 2026

@coderabbitai resume

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

✅ Actions performed

Reviews resumed.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Actionable comments posted: 0

@djoshy
Copy link
Copy Markdown
Contributor Author

djoshy commented May 28, 2026

/payload-job periodic-ci-openshift-release-main-nightly-5.0-e2e-vsphere-ovn-csi-vcf9 periodic-ci-openshift-machine-config-operator-release-5.0-periodics-e2e-vsphere-mco-disruptive

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 28, 2026

@djoshy: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-nightly-5.0-e2e-vsphere-ovn-csi-vcf9
  • periodic-ci-openshift-machine-config-operator-release-5.0-periodics-e2e-vsphere-mco-disruptive

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/938adf70-5aad-11f1-81ba-0b05c5412494-0

@djoshy
Copy link
Copy Markdown
Contributor Author

djoshy commented May 28, 2026

/payload-job periodic-ci-openshift-release-main-nightly-5.0-e2e-vsphere-ovn-csi

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 28, 2026

@djoshy: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-nightly-5.0-e2e-vsphere-ovn-csi

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/b3cdfa30-5abf-11f1-9f48-2fe9acf07461-0

@djoshy djoshy marked this pull request as ready for review May 28, 2026 18:05
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 28, 2026
@openshift-ci openshift-ci Bot requested review from cheesesashimi and umohnani8 May 28, 2026 18:06
@djoshy djoshy force-pushed the fix-vsphere-template-naming branch from 35c629c to dbea5e5 Compare May 28, 2026 19:23
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Actionable comments posted: 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants