(WIP) OCPBUGS-86181 by isabella-janssen · Pull Request #6107 · openshift/machine-config-operator

isabella-janssen · 2026-05-29T20:21:50Z

- What I did

- How to verify it

- Description for the changelog

Summary by CodeRabbit

Tests
- Improved robustness of leader-election verification with polling-based waits instead of immediate assertions.
- Enhanced test resilience for single-node clusters with adaptive timeout handling and connection-error recovery.
- Made machine count transition validation more reliable with dynamic retry logic.
- Added support for testing with USB guard extension in machine configurations.

openshift-merge-bot · 2026-05-29T20:21:52Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

openshift-ci · 2026-05-29T20:21:54Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

coderabbitai · 2026-05-29T20:22:03Z

Walkthrough

This PR hardens MachineConfig test reliability by adding SNO-aware transition validation, making machine-count checks resilient to transient connection failures, enabling test fallback to master MCP when worker is unavailable, and improving MOSC leader-election log waiting robustness.

Changes

SNO and Image Mode Test Reliability

Layer / File(s)	Summary
SNO Detection Infrastructure and Function Signature Updates `test/extended/machineconfignode.go`, `test/extended/image_mode_status_reporting.go`	Added `errors` and `syscall` imports for connection-error classification. Updated `ValidateTransitionThroughConditions` signature to accept `oc *exutil.CLI` parameter, enabling SNO cluster detection within transition validation.
SNO-Aware Transition Validation Logic `test/extended/machineconfignode.go`	Implemented SNO detection and adjusted timing windows for SNO clusters. Modified transition checks to skip Cordoned/Drained validation for SNO, handle ECONNREFUSED errors as warnings during AppliedFiles and UpdateExecuted phases, and restrict UpdateComplete/Uncordoned checks to non-SNO non-rebootless updates.
Connection-Error Resilience in Machine Count Validation `test/extended/image_mode_status_reporting.go`	Reworked `validateMCPMachineCountTransitions` with dynamic retry count for layered and SNO updates. Changed `mcnAndNodeAnnotationMachineCountsMatch` to return `(countsMatch, isConnErr)` and classify connection-refused errors via `errors.Is(..., syscall.ECONNREFUSED)` to enable retries on transient failures.
Default MCP Fallback Test Support `test/extended/image_mode_status_reporting.go`	Added `runImageModeMCNTestDefaultMCP` helper to run tests against master MCP when worker MCP has no nodes. Updated test cases to branch between custom infra and default master MCP execution. Updated `validateMCNTransitions` to pass `oc` context and rebootless flag to transition validation.
Test Fixtures and MOSC Leader-Election Robustness `test/extended-priv/mco_ocb.go`, `test/extended-priv/testdata/files/machineconfigs/master-extension-mc.yaml`	Added master-extension-mc.yaml fixture defining a master-role MachineConfig with usbguard extension. Wrapped `ValidateSuccessfulMOSC` leader-election log assertion in `Eventually` block polling for up to 5 minutes instead of immediate assertion.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

dkhater-redhat

🚥 Pre-merge checks | ✅ 4 | ❌ 11

❌ Failed checks (1 warning, 10 inconclusive)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	The pull request title is a single reference to a Jira ticket ID without describing any actual changes or objectives. It provides no meaningful information about what the code modifications accomplish.	Replace the title with a descriptive summary of the main changes, such as 'Add SNO-aware resilience to test suite for machine config transitions' or similar that captures the primary purpose of the modifications.
Stable And Deterministic Test Names	❓ Inconclusive	Repository clone failed, so this custom check could not run with code access.	Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Test Structure And Quality	❓ Inconclusive	Repository clone failed, so this custom check could not run with code access.	Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Microshift Test Compatibility	❓ Inconclusive	Repository clone failed, so this custom check could not run with code access.	Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Single Node Openshift (Sno) Test Compatibility	❓ Inconclusive	Repository clone failed, so this custom check could not run with code access.	Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Topology-Aware Scheduling Compatibility	❓ Inconclusive	Repository clone failed, so this custom check could not run with code access.	Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Ote Binary Stdout Contract	❓ Inconclusive	Repository clone failed, so this custom check could not run with code access.	Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Ipv6 And Disconnected Network Test Compatibility	❓ Inconclusive	Repository clone failed, so this custom check could not run with code access.	Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
No-Weak-Crypto	❓ Inconclusive	Repository clone failed, so this custom check could not run with code access.	Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Container-Privileges	❓ Inconclusive	Repository clone failed, so this custom check could not run with code access.	Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
No-Sensitive-Data-In-Logs	❓ Inconclusive	Repository clone failed, so this custom check could not run with code access.	Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Warning

Tools execution failed with the following error:

Failed to run tools: 13 INTERNAL: Received RST_STREAM with code 2 (Internal server error)

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-05-29T20:22:05Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: isabella-janssen

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [isabella-janssen]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

isabella-janssen · 2026-05-29T20:22:18Z

/payload-aggregate periodic-ci-openshift-machine-config-operator-release-5.0-periodics-e2e-aws-mco-single-node-disruptive-techpreview 5

openshift-ci · 2026-05-29T20:22:22Z

@isabella-janssen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-machine-config-operator-release-5.0-periodics-e2e-aws-mco-single-node-disruptive-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/178b0a90-5b9c-11f1-8b57-67f8e83fcc97-0

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/extended-priv/mco_ocb.go`:
- Around line 314-317: The callback passed to o.Eventually currently discards
the error from mOSBuilder.Logs(); change it to capture the error (logs, err :=
mOSBuilder.Logs()) and handle it instead of using the blank identifier: if err
!= nil, return a descriptive string containing the error (or otherwise surface
the error to the test/failer) so the Eventually assertion sees failures to fetch
logs; update the callback around mOSBuilder.Logs() used inside o.Eventually(...)
to propagate or report the error rather than ignoring it.

In `@test/extended/image_mode_status_reporting.go`:
- Around line 394-409: In validateMCPMachineCountTransitions, stop ignoring the
error from extpriv.IsSNOSafe(oc): capture the returned error, log or return it
(or set a safe default for loopCount) and decide fallback behavior when
IsSNOSafe fails; also ensure the SNO retry-budget bump (loopCount = 10) is
applied where intended—either move the IsSNOSafe check out of the layered-only
block or duplicate its handling for the non-layered path so SNO clusters get the
higher retry count in both cases if that’s the desired behavior.

In `@test/extended/machineconfignode.go`:
- Around line 302-315: The code currently skips asserting UpdateExecuted=True
unconditionally; restore a failing assertion for non-SNO/non-connection-error
paths: after calling waitForMCNConditionStatus (function
waitForMCNConditionStatus, vars conditionMet and err, checking isSNO and
errors.Is(err, syscall.ECONNREFUSED)), keep the special-case that logs and
tolerates the ECONNREFUSED for SNO, but in the else branch ensure you both
assert no error (o.Expect(err).NotTo(HaveOccurred())) and assert conditionMet is
true (o.Expect(conditionMet).To(o.BeTrue(), fmt.Sprintf("Error, could not detect
%s on node %s: %v", mcfgv1.MachineConfigNodeUpdateExecuted, updatingNodeName,
err))). This guarantees non-SNO and non-conn-error runs fail if UpdateExecuted
was not observed.
- Around line 208-216: The call to extpriv.IsSNOSafe(oc) currently discards its
error (isSNO, _ := extpriv.IsSNOSafe(oc)); change it to capture the error and
handle it explicitly (e.g., isSNO, err := extpriv.IsSNOSafe(oc); if err != nil {
/* fail the test or log and return */ }) so that SNO detection failures do not
silently fall back to non-SNO timing; update the surrounding logic that uses
isSNO, and ensure any failure uses the test framework's failure/logging
mechanism (e.g., t.Fatalf, framework.Failf or klog/Logf as appropriate) instead
of ignoring the error while still setting updatingWaitTime and
updatingWaitInterval.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 08da7797-b463-4018-8e76-ae6f432da12b

📥 Commits

Reviewing files that changed from the base of the PR and between d72b715 and 5614069.

📒 Files selected for processing (4)

test/extended-priv/mco_ocb.go
test/extended-priv/testdata/files/machineconfigs/master-extension-mc.yaml
test/extended/image_mode_status_reporting.go
test/extended/machineconfignode.go

coderabbitai · 2026-05-29T20:31:14Z

+	o.Eventually(func() string {
+		logs, _ := mOSBuilder.Logs()
+		return logs
+	}, "5m", "10s").Should(o.And(


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle mOSBuilder.Logs() errors instead of discarding them.

The callback currently ignores the error return (logs, _ := ...), which can hide transient/fatal log-fetch failures and make this assertion flaky.

Proposed fix

- o.Eventually(func() string { - logs, _ := mOSBuilder.Logs() - return logs + o.Eventually(func() (string, error) { + return mOSBuilder.Logs() }, "5m", "10s").Should(o.And( o.MatchRegexp("(?i)"+regexp.QuoteMeta("attempting to acquire leader lease")), o.MatchRegexp("(?i)"+regexp.QuoteMeta("successfully acquired lease"))), "The machine os builder pod is not using the leader election without failures")

As per coding guidelines: **/*.go: “Never ignore error returns”.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

o.Eventually(func() string {

logs, _ := mOSBuilder.Logs()

return logs

}, "5m", "10s").Should(o.And(

o.Eventually(func() (string, error) {

return mOSBuilder.Logs()

}, "5m", "10s").Should(o.And(

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/extended-priv/mco_ocb.go` around lines 314 - 317, The callback passed to o.Eventually currently discards the error from mOSBuilder.Logs(); change it to capture the error (logs, err := mOSBuilder.Logs()) and handle it instead of using the blank identifier: if err != nil, return a descriptive string containing the error (or otherwise surface the error to the test/failer) so the Eventually assertion sees failures to fetch logs; update the callback around mOSBuilder.Logs() used inside o.Eventually(...) to propagate or report the error rather than ignoring it.

coderabbitai · 2026-05-29T20:31:14Z

 func validateMCPMachineCountTransitions(machineConfigClient *machineconfigclient.Clientset, oc *exutil.CLI, mcpName string, startTime metav1.Time, mosc *extpriv.MachineOSConfig, layered bool) {
 	timeout := 15 * time.Minute
 	interval := 10 * time.Second
+	loopCount := 3
 	// For on-cluster image mode updates, set the timeout to be longer
 	if layered {
 		logger.Infof("Layered update, setting longer update timeout.")
 		timeout = 45 * time.Minute
 		interval = 30 * time.Second
+		// SNO clusters require a longer time to reconcile due to the MCC restart in
+		// reboot-required updates, so allow for more retries.
+		if isSNO, _ := extpriv.IsSNOSafe(oc); isSNO {
+			logger.Infof("Cluster is SNO, setting higher retry count.")
+			loopCount = 10
+		}
 	}


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail sed -n '332,385p' test/extended/image_mode_status_reporting.go printf '\n---\n' sed -n '394,409p' test/extended/image_mode_status_reporting.go

Repository: openshift/machine-config-operator

Length of output: 2017

🏁 Script executed:

#!/bin/bash set -euo pipefail # Confirm the caller(s) of validateMCPMachineCountTransitions rg -n "validateMCPMachineCountTransitions" test/extended/image_mode_status_reporting.go # Show the runMachineCountTest implementation area that calls it rg -n "func runMachineCountTest" -n test/extended/image_mode_status_reporting.go sed -n '300,520p' test/extended/image_mode_status_reporting.go # Re-print the exact block referenced in the review sed -n '380,430p' test/extended/image_mode_status_reporting.go

Repository: openshift/machine-config-operator

Length of output: 1988

Handle IsSNOSafe error and verify SNO retry-budget scope

validateMCPMachineCountTransitions discards the error from extpriv.IsSNOSafe(oc) (isSNO, _ := ...), violating the Go security rule “Never ignore error returns”; handle the error explicitly and decide the fallback behavior for loopCount.

The SNO loopCount = 10 bump is only reachable inside if layered, so SNO clusters will only get the higher retry budget for layered updates—confirm this is intended and extend the logic to the non-layered path if not.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/extended/image_mode_status_reporting.go` around lines 394 - 409, In validateMCPMachineCountTransitions, stop ignoring the error from extpriv.IsSNOSafe(oc): capture the returned error, log or return it (or set a safe default for loopCount) and decide fallback behavior when IsSNOSafe fails; also ensure the SNO retry-budget bump (loopCount = 10) is applied where intended—either move the IsSNOSafe check out of the layered-only block or duplicate its handling for the non-layered path so SNO clusters get the higher retry count in both cases if that’s the desired behavior.

coderabbitai · 2026-05-29T20:31:14Z

+	isSNO, _ := extpriv.IsSNOSafe(oc)
 	if isImageMode {
 		updatingWaitTime = 25 * time.Minute
 		updatingWaitInterval = 5 * time.Second
+	} else if isSNO {
+		// SNO transition times can also be a bit longer due to cluster connection instability, so
+		// set the times for longer in such clusters. (Longer image updates timing takes precedence.)
+		updatingWaitTime = 5 * time.Minute
+		updatingWaitInterval = 5 * time.Second


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail rg -n --type=go 'IsSNOSafe\(oc\)' test/extended/machineconfignode.go test/extended/image_mode_status_reporting.go

Repository: openshift/machine-config-operator

Length of output: 2017

Don’t discard extpriv.IsSNOSafe(oc) errors in test/extended/machineconfignode.go (lines ~208-216)

isSNO, _ := extpriv.IsSNOSafe(oc) swallows the error, so SNO detection failures can silently switch to the non-SNO timing/assertion path. Handle the returned error explicitly (don’t use _).

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/extended/machineconfignode.go` around lines 208 - 216, The call to extpriv.IsSNOSafe(oc) currently discards its error (isSNO, _ := extpriv.IsSNOSafe(oc)); change it to capture the error and handle it explicitly (e.g., isSNO, err := extpriv.IsSNOSafe(oc); if err != nil { /* fail the test or log and return */ }) so that SNO detection failures do not silently fall back to non-SNO timing; update the surrounding logic that uses isSNO, and ensure any failure uses the test framework's failure/logging mechanism (e.g., t.Fatalf, framework.Failf or klog/Logf as appropriate) instead of ignoring the error while still setting updatingWaitTime and updatingWaitInterval.

coderabbitai · 2026-05-29T20:31:14Z

 	logger.Infof("Waiting for UpdateExecuted=True")
 	conditionMet, err = waitForMCNConditionStatus(machineConfigClient, updatingNodeName, mcfgv1.MachineConfigNodeUpdateExecuted, metav1.ConditionTrue, 20*time.Second, 1*time.Second)
-	o.Expect(err).NotTo(o.HaveOccurred(), fmt.Sprintf("Error occurred while waiting for UpdateExecuted=True: %v", err))
-	o.Expect(conditionMet).To(o.BeTrue(), "Error, could not detect UpdateExecuted=True.")
+	// A node update in SNO can quickly transition through the "AppliedFiles" state and cause
+	// us to catch the update during the node reboot. During reboot, we will get connection
+	// errors and, thus, should not error out on such errors.
+	if isSNO && err != nil && errors.Is(err, syscall.ECONNREFUSED) {
+		logger.Infof("Warning, got connection error detecting UpdateExecuted=True.")
+	} else {
+		o.Expect(err).NotTo(o.HaveOccurred(), fmt.Sprintf("Error occurred while waiting for UpdateExecuted=True: %v", err))
+	}
+	if !conditionMet {
+		logger.Infof("Warning, could not detect UpdateExecuted=True.")
+	}
+	// o.Expect(conditionMet).To(o.BeTrue(), "Error, could not detect UpdateExecuted=True.")


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

UpdateExecuted=True became best-effort for every environment.

The final Expect(conditionMet).To(BeTrue()) is removed unconditionally, so non-SNO and non-connection-error paths can now miss this transition without failing. That weakens the coverage outside the new SNO exception.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/extended/machineconfignode.go` around lines 302 - 315, The code currently skips asserting UpdateExecuted=True unconditionally; restore a failing assertion for non-SNO/non-connection-error paths: after calling waitForMCNConditionStatus (function waitForMCNConditionStatus, vars conditionMet and err, checking isSNO and errors.Is(err, syscall.ECONNREFUSED)), keep the special-case that logs and tolerates the ECONNREFUSED for SNO, but in the else branch ensure you both assert no error (o.Expect(err).NotTo(HaveOccurred())) and assert conditionMet is true (o.Expect(conditionMet).To(o.BeTrue(), fmt.Sprintf("Error, could not detect %s on node %s: %v", mcfgv1.MachineConfigNodeUpdateExecuted, updatingNodeName, err))). This guarantees non-SNO and non-conn-error runs fail if UpdateExecuted was not observed.

isabella-janssen added 6 commits May 27, 2026 09:14

tests: make MCP machine count transition tests resilient on SNO

cbfb92a

run MCN prop check on SNO

1a28f73

.

c28d5e5

no reboot update test

c689531

timeout for ocb

01f9c36

fixes for condition transition tests

5614069

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 29, 2026

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 29, 2026

coderabbitai Bot reviewed May 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(WIP) OCPBUGS-86181#6107

(WIP) OCPBUGS-86181#6107
isabella-janssen wants to merge 6 commits into
openshift:mainfrom
isabella-janssen:ocpbugs-86181

isabella-janssen commented May 29, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

openshift-merge-bot Bot commented May 29, 2026

Uh oh!

openshift-ci Bot commented May 29, 2026

Uh oh!

coderabbitai Bot commented May 29, 2026 •

edited

Loading

❌ Failed checks (1 warning, 10 inconclusive)

Uh oh!

openshift-ci Bot commented May 29, 2026

Uh oh!

isabella-janssen commented May 29, 2026

Uh oh!

openshift-ci Bot commented May 29, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 29, 2026

Uh oh!

coderabbitai Bot May 29, 2026

Uh oh!

coderabbitai Bot May 29, 2026

Uh oh!

coderabbitai Bot May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

isabella-janssen commented May 29, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

openshift-merge-bot Bot commented May 29, 2026

Uh oh!

openshift-ci Bot commented May 29, 2026

Uh oh!

coderabbitai Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning, 10 inconclusive)

Uh oh!

openshift-ci Bot commented May 29, 2026

Uh oh!

isabella-janssen commented May 29, 2026

Uh oh!

openshift-ci Bot commented May 29, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

isabella-janssen commented May 29, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 29, 2026 •

edited

Loading