(WIP) OCPBUGS-86181#6107
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
Skipping CI for Draft Pull Request. |
WalkthroughThis PR hardens MachineConfig test reliability by adding SNO-aware transition validation, making machine-count checks resilient to transient connection failures, enabling test fallback to master MCP when worker is unavailable, and improving MOSC leader-election log waiting robustness. ChangesSNO and Image Mode Test Reliability
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 11❌ Failed checks (1 warning, 10 inconclusive)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning Tools execution failed with the following error: Failed to run tools: 13 INTERNAL: Received RST_STREAM with code 2 (Internal server error) Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: isabella-janssen The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/payload-aggregate periodic-ci-openshift-machine-config-operator-release-5.0-periodics-e2e-aws-mco-single-node-disruptive-techpreview 5 |
|
@isabella-janssen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/178b0a90-5b9c-11f1-8b57-67f8e83fcc97-0 |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/extended-priv/mco_ocb.go`:
- Around line 314-317: The callback passed to o.Eventually currently discards
the error from mOSBuilder.Logs(); change it to capture the error (logs, err :=
mOSBuilder.Logs()) and handle it instead of using the blank identifier: if err
!= nil, return a descriptive string containing the error (or otherwise surface
the error to the test/failer) so the Eventually assertion sees failures to fetch
logs; update the callback around mOSBuilder.Logs() used inside o.Eventually(...)
to propagate or report the error rather than ignoring it.
In `@test/extended/image_mode_status_reporting.go`:
- Around line 394-409: In validateMCPMachineCountTransitions, stop ignoring the
error from extpriv.IsSNOSafe(oc): capture the returned error, log or return it
(or set a safe default for loopCount) and decide fallback behavior when
IsSNOSafe fails; also ensure the SNO retry-budget bump (loopCount = 10) is
applied where intended—either move the IsSNOSafe check out of the layered-only
block or duplicate its handling for the non-layered path so SNO clusters get the
higher retry count in both cases if that’s the desired behavior.
In `@test/extended/machineconfignode.go`:
- Around line 302-315: The code currently skips asserting UpdateExecuted=True
unconditionally; restore a failing assertion for non-SNO/non-connection-error
paths: after calling waitForMCNConditionStatus (function
waitForMCNConditionStatus, vars conditionMet and err, checking isSNO and
errors.Is(err, syscall.ECONNREFUSED)), keep the special-case that logs and
tolerates the ECONNREFUSED for SNO, but in the else branch ensure you both
assert no error (o.Expect(err).NotTo(HaveOccurred())) and assert conditionMet is
true (o.Expect(conditionMet).To(o.BeTrue(), fmt.Sprintf("Error, could not detect
%s on node %s: %v", mcfgv1.MachineConfigNodeUpdateExecuted, updatingNodeName,
err))). This guarantees non-SNO and non-conn-error runs fail if UpdateExecuted
was not observed.
- Around line 208-216: The call to extpriv.IsSNOSafe(oc) currently discards its
error (isSNO, _ := extpriv.IsSNOSafe(oc)); change it to capture the error and
handle it explicitly (e.g., isSNO, err := extpriv.IsSNOSafe(oc); if err != nil {
/* fail the test or log and return */ }) so that SNO detection failures do not
silently fall back to non-SNO timing; update the surrounding logic that uses
isSNO, and ensure any failure uses the test framework's failure/logging
mechanism (e.g., t.Fatalf, framework.Failf or klog/Logf as appropriate) instead
of ignoring the error while still setting updatingWaitTime and
updatingWaitInterval.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 08da7797-b463-4018-8e76-ae6f432da12b
📒 Files selected for processing (4)
test/extended-priv/mco_ocb.gotest/extended-priv/testdata/files/machineconfigs/master-extension-mc.yamltest/extended/image_mode_status_reporting.gotest/extended/machineconfignode.go
| o.Eventually(func() string { | ||
| logs, _ := mOSBuilder.Logs() | ||
| return logs | ||
| }, "5m", "10s").Should(o.And( |
There was a problem hiding this comment.
Handle mOSBuilder.Logs() errors instead of discarding them.
The callback currently ignores the error return (logs, _ := ...), which can hide transient/fatal log-fetch failures and make this assertion flaky.
Proposed fix
- o.Eventually(func() string {
- logs, _ := mOSBuilder.Logs()
- return logs
+ o.Eventually(func() (string, error) {
+ return mOSBuilder.Logs()
}, "5m", "10s").Should(o.And(
o.MatchRegexp("(?i)"+regexp.QuoteMeta("attempting to acquire leader lease")),
o.MatchRegexp("(?i)"+regexp.QuoteMeta("successfully acquired lease"))),
"The machine os builder pod is not using the leader election without failures")As per coding guidelines: **/*.go: “Never ignore error returns”.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| o.Eventually(func() string { | |
| logs, _ := mOSBuilder.Logs() | |
| return logs | |
| }, "5m", "10s").Should(o.And( | |
| o.Eventually(func() (string, error) { | |
| return mOSBuilder.Logs() | |
| }, "5m", "10s").Should(o.And( |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/extended-priv/mco_ocb.go` around lines 314 - 317, The callback passed to
o.Eventually currently discards the error from mOSBuilder.Logs(); change it to
capture the error (logs, err := mOSBuilder.Logs()) and handle it instead of
using the blank identifier: if err != nil, return a descriptive string
containing the error (or otherwise surface the error to the test/failer) so the
Eventually assertion sees failures to fetch logs; update the callback around
mOSBuilder.Logs() used inside o.Eventually(...) to propagate or report the error
rather than ignoring it.
| func validateMCPMachineCountTransitions(machineConfigClient *machineconfigclient.Clientset, oc *exutil.CLI, mcpName string, startTime metav1.Time, mosc *extpriv.MachineOSConfig, layered bool) { | ||
| timeout := 15 * time.Minute | ||
| interval := 10 * time.Second | ||
| loopCount := 3 | ||
| // For on-cluster image mode updates, set the timeout to be longer | ||
| if layered { | ||
| logger.Infof("Layered update, setting longer update timeout.") | ||
| timeout = 45 * time.Minute | ||
| interval = 30 * time.Second | ||
| // SNO clusters require a longer time to reconcile due to the MCC restart in | ||
| // reboot-required updates, so allow for more retries. | ||
| if isSNO, _ := extpriv.IsSNOSafe(oc); isSNO { | ||
| logger.Infof("Cluster is SNO, setting higher retry count.") | ||
| loopCount = 10 | ||
| } | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
sed -n '332,385p' test/extended/image_mode_status_reporting.go
printf '\n---\n'
sed -n '394,409p' test/extended/image_mode_status_reporting.goRepository: openshift/machine-config-operator
Length of output: 2017
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Confirm the caller(s) of validateMCPMachineCountTransitions
rg -n "validateMCPMachineCountTransitions" test/extended/image_mode_status_reporting.go
# Show the runMachineCountTest implementation area that calls it
rg -n "func runMachineCountTest" -n test/extended/image_mode_status_reporting.go
sed -n '300,520p' test/extended/image_mode_status_reporting.go
# Re-print the exact block referenced in the review
sed -n '380,430p' test/extended/image_mode_status_reporting.goRepository: openshift/machine-config-operator
Length of output: 1988
Handle IsSNOSafe error and verify SNO retry-budget scope
validateMCPMachineCountTransitionsdiscards the error fromextpriv.IsSNOSafe(oc)(isSNO, _ := ...), violating the Go security rule “Never ignore error returns”; handle the error explicitly and decide the fallback behavior forloopCount.- The SNO
loopCount = 10bump is only reachable insideif layered, so SNO clusters will only get the higher retry budget for layered updates—confirm this is intended and extend the logic to the non-layered path if not.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/extended/image_mode_status_reporting.go` around lines 394 - 409, In
validateMCPMachineCountTransitions, stop ignoring the error from
extpriv.IsSNOSafe(oc): capture the returned error, log or return it (or set a
safe default for loopCount) and decide fallback behavior when IsSNOSafe fails;
also ensure the SNO retry-budget bump (loopCount = 10) is applied where
intended—either move the IsSNOSafe check out of the layered-only block or
duplicate its handling for the non-layered path so SNO clusters get the higher
retry count in both cases if that’s the desired behavior.
| isSNO, _ := extpriv.IsSNOSafe(oc) | ||
| if isImageMode { | ||
| updatingWaitTime = 25 * time.Minute | ||
| updatingWaitInterval = 5 * time.Second | ||
| } else if isSNO { | ||
| // SNO transition times can also be a bit longer due to cluster connection instability, so | ||
| // set the times for longer in such clusters. (Longer image updates timing takes precedence.) | ||
| updatingWaitTime = 5 * time.Minute | ||
| updatingWaitInterval = 5 * time.Second |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n --type=go 'IsSNOSafe\(oc\)' test/extended/machineconfignode.go test/extended/image_mode_status_reporting.goRepository: openshift/machine-config-operator
Length of output: 2017
Don’t discard extpriv.IsSNOSafe(oc) errors in test/extended/machineconfignode.go (lines ~208-216)
isSNO, _ := extpriv.IsSNOSafe(oc) swallows the error, so SNO detection failures can silently switch to the non-SNO timing/assertion path. Handle the returned error explicitly (don’t use _).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/extended/machineconfignode.go` around lines 208 - 216, The call to
extpriv.IsSNOSafe(oc) currently discards its error (isSNO, _ :=
extpriv.IsSNOSafe(oc)); change it to capture the error and handle it explicitly
(e.g., isSNO, err := extpriv.IsSNOSafe(oc); if err != nil { /* fail the test or
log and return */ }) so that SNO detection failures do not silently fall back to
non-SNO timing; update the surrounding logic that uses isSNO, and ensure any
failure uses the test framework's failure/logging mechanism (e.g., t.Fatalf,
framework.Failf or klog/Logf as appropriate) instead of ignoring the error while
still setting updatingWaitTime and updatingWaitInterval.
| logger.Infof("Waiting for UpdateExecuted=True") | ||
| conditionMet, err = waitForMCNConditionStatus(machineConfigClient, updatingNodeName, mcfgv1.MachineConfigNodeUpdateExecuted, metav1.ConditionTrue, 20*time.Second, 1*time.Second) | ||
| o.Expect(err).NotTo(o.HaveOccurred(), fmt.Sprintf("Error occurred while waiting for UpdateExecuted=True: %v", err)) | ||
| o.Expect(conditionMet).To(o.BeTrue(), "Error, could not detect UpdateExecuted=True.") | ||
| // A node update in SNO can quickly transition through the "AppliedFiles" state and cause | ||
| // us to catch the update during the node reboot. During reboot, we will get connection | ||
| // errors and, thus, should not error out on such errors. | ||
| if isSNO && err != nil && errors.Is(err, syscall.ECONNREFUSED) { | ||
| logger.Infof("Warning, got connection error detecting UpdateExecuted=True.") | ||
| } else { | ||
| o.Expect(err).NotTo(o.HaveOccurred(), fmt.Sprintf("Error occurred while waiting for UpdateExecuted=True: %v", err)) | ||
| } | ||
| if !conditionMet { | ||
| logger.Infof("Warning, could not detect UpdateExecuted=True.") | ||
| } | ||
| // o.Expect(conditionMet).To(o.BeTrue(), "Error, could not detect UpdateExecuted=True.") |
There was a problem hiding this comment.
UpdateExecuted=True became best-effort for every environment.
The final Expect(conditionMet).To(BeTrue()) is removed unconditionally, so non-SNO and non-connection-error paths can now miss this transition without failing. That weakens the coverage outside the new SNO exception.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/extended/machineconfignode.go` around lines 302 - 315, The code
currently skips asserting UpdateExecuted=True unconditionally; restore a failing
assertion for non-SNO/non-connection-error paths: after calling
waitForMCNConditionStatus (function waitForMCNConditionStatus, vars conditionMet
and err, checking isSNO and errors.Is(err, syscall.ECONNREFUSED)), keep the
special-case that logs and tolerates the ECONNREFUSED for SNO, but in the else
branch ensure you both assert no error (o.Expect(err).NotTo(HaveOccurred())) and
assert conditionMet is true (o.Expect(conditionMet).To(o.BeTrue(),
fmt.Sprintf("Error, could not detect %s on node %s: %v",
mcfgv1.MachineConfigNodeUpdateExecuted, updatingNodeName, err))). This
guarantees non-SNO and non-conn-error runs fail if UpdateExecuted was not
observed.
- What I did
- How to verify it
- Description for the changelog
Summary by CodeRabbit