Skip to content

MCO-2167: Migrate alert test cases from the private repository#5781

Open
sergiordlr wants to merge 2 commits intoopenshift:mainfrom
sergiordlr:migrate_alerts_test_cases
Open

MCO-2167: Migrate alert test cases from the private repository#5781
sergiordlr wants to merge 2 commits intoopenshift:mainfrom
sergiordlr:migrate_alerts_test_cases

Conversation

@sergiordlr
Copy link
Contributor

@sergiordlr sergiordlr commented Mar 19, 2026

- What I did
Migrate the MCO alerts test cases from the private repository

Remove test case "[PolarionID:69184][OTP] Enable feature gate MachineConfigNodes [apigroup:machineconfiguration.openshift.io]" since MachineConfigNodes featuregate doesn't exist anymore.

- How to verify it
All tests must pass

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 19, 2026
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Mar 19, 2026

@sergiordlr: This pull request references MCO-2167 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

- What I did
Migrate the MCO alerts test cases from the private repository

- How to verify it
All tests must pass

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link

coderabbitai bot commented Mar 19, 2026

Walkthrough

Adds test infra for image layering and MCO alert validation: new constants and repo/tag helpers, an in-node OS image builder, node disruption/version helpers, Prometheus/Thanos monitoring utilities and alert filtering, a disruptive MCO alert suite, a helper script, and small pod/MCP utilities.

Changes

Cohort / File(s) Summary
Configuration & Constants
test/extended-priv/const.go
Added exported env var EnvVarLayeringTestImageRepository, exported LayeringBaseImageReleaseInfoRhel10, and internal constants for temporary namespace and registry-admin SA naming.
Image build helpers & repo/tag logic
test/extended-priv/containers.go
Added prepareDockerfileDirectory, exported GetLayeringBaseImageByStream, getLayeringTestImageRepository, and generateUniqueTag for repository selection and deterministic tag generation.
In-node image builder
test/extended-priv/inside_node_containers.go
Added OsImageBuilderInNode and CreateAndDigestOsImage implementing environment prep, optional internal-registry auth, podman build/push (manifest vs tag), cleanup, and digest computation.
Node manipulation & version helpers
test/extended-priv/node.go
Reordered NodeList method; added BreakRebootInNode, FixRebootInNode, BreakRebaseInNode, FixRebaseInNode, and (*Node) GetRHCOSVersion() (adds github.com/tidwall/gjson import).
MCO alert test suite
test/extended-priv/mco_alerts.go
New long-running serial disruptive Ginkgo suite injecting MCP/node faults and asserting Prometheus alert presence, labels/annotations, pending→firing transitions, and remediation checks.
Prometheus/Thanos monitoring utilities
test/extended-priv/util/prometheus_monitoring.go
Added Monitor/PrometheusMonitor types, instant/range query param types, NewMonitor/NewPrometheusMonitor, GetSAToken (with fallback), query/rules/alerts retrieval, and InstantQueryWithRetry.
Alert filtering helper
test/extended-priv/util.go
Added getAlertsByName to fetch alerts via PrometheusMonitor, validate JSON, filter by labels.alertname, and return []map[string]interface{}.
Remote pod shell wrapper
test/extended-priv/util/pods.go
Added exported RemoteShPod(oc *CLI, namespace, podName string, cmd ...string) delegating to remoteShPod with fixed flags.
MachineConfigPool helper
test/extended-priv/machineconfigpool.go
Added GetCoreOsCompatiblePool(oc *exutil.CLI) *MachineConfigPool to prefer worker MCP when CoreOS nodes exist, otherwise fall back to master MCP.
Test helper script
test/extended-priv/testdata/files/rpm-ostree-force-pivot-error.sh
Added script that forces an rpm-ostree rebase failure when invoked with rebase, otherwise proxies to /tmp/rpm-ostree.
Removed test case
test/extended-priv/mco_machineconfignode.go
Removed a Ginkgo It test that asserted featuregate presence for MachineConfigNodes.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@sergiordlr sergiordlr force-pushed the migrate_alerts_test_cases branch from 0e3c210 to fb1747e Compare March 19, 2026 11:30
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 19, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sergiordlr
Once this PR has been reviewed and has the lgtm label, please assign rishabhsaini for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (3)
test/extended-priv/mco_alerts.go (1)

294-303: Goroutine error handling may not fail the test properly.

The goroutine uses o.Expect(err).NotTo(o.HaveOccurred()) but if this assertion fails, the main test may have already moved past the goroutine's execution context. Consider using a channel or different synchronization to propagate errors reliably.

Additionally, the hardcoded 5-minute sleep (sleep 300) should align with the 6-minute timeout on line 325.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended-priv/mco_alerts.go` around lines 294 - 303, The anonymous
goroutine that calls node.DebugNodeWithChroot currently asserts with o.Expect
inside the goroutine and uses a hardcoded "sleep 300"; change it to propagate
success/error back to the test via a channel instead of calling o.Expect inside
the goroutine and make the sleep duration configurable to match the test
timeout. Create an errCh (chan error) and have the goroutine send the returned
err (or nil on success) after it runs node.DebugNodeWithChroot and restarts
kubelet (and set fixed = true on success); in the main test select on errCh with
a timeout that matches the test's 6-minute timeout (or define a shared constant
like kubeletRestartDelay and use it both in the chroot command and the select)
so failures in node.DebugNodeWithChroot (and the restart) reliably fail the
test.
test/extended-priv/inside_node_containers.go (1)

346-361: Consider cleanup on failure to prevent disk space leaks.

If pushImage() fails, the locally built image remains on the node. Over multiple failed test runs, this could accumulate and consume node disk space.

♻️ Suggested approach using defer
 func (b *OsImageBuilderInNode) CreateAndDigestOsImage() (string, error) {
 	if err := b.prepareEnvironment(); err != nil {
 		return "", err
 	}
 	if err := b.buildImage(); err != nil {
 		return "", err
 	}
+	// Ensure local image is cleaned up regardless of push/digest outcome
+	defer func() {
+		if rmErr := b.removeImage(); rmErr != nil {
+			logger.Warnf("Failed to remove local image: %v", rmErr)
+		}
+	}()
 	if err := b.pushImage(); err != nil {
 		return "", err
 	}
-	if err := b.removeImage(); err != nil {
-		return "", err
-	}
 	return b.digestImage()
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended-priv/inside_node_containers.go` around lines 346 - 361, The
CreateAndDigestOsImage flow can leak the local image if pushImage() fails; after
a successful buildImage() call register a deferred cleanup by calling
removeImage() (e.g., defer b.removeImage()) so the locally built image is
removed on any subsequent error, then call pushImage() and finally
digestImage(); ensure the deferred removeImage() only gets registered after
buildImage() succeeds and propagate errors from pushImage(), digestImage(), and
removeImage() appropriately.
test/extended-priv/util/prometheus_monitoring.go (1)

189-197: Inconsistent command construction pattern.

GetAlerts constructs the curl command as a shell string with sh -c, while other methods (InstantQuery, RangeQuery, queryRules) use argument arrays passed directly to RemoteShPod. This inconsistency reduces maintainability and makes reasoning about the code harder.

Consider aligning with the array-based pattern used elsewhere:

♻️ Suggested refactor
 func (pmo *PrometheusMonitor) GetAlerts() (string, error) {
 
 	// We don't want to print the token
 	pmo.ocClient.NotShowInfo()
 	defer pmo.ocClient.SetShowInfo()
 
-	getCmd := "curl -k -s -H \"" + fmt.Sprintf("Authorization: Bearer %v", pmo.Token) + "\" " + pmo.url + monitorAlerts
-	return RemoteShPod(pmo.ocClient, monitorNamespace, "statefulsets/"+prometheusK8s, "sh", "-c", getCmd)
+	queryArgs := []string{"curl", "-k", "-s", "-H", fmt.Sprintf("Authorization: Bearer %v", pmo.Token), pmo.url + monitorAlerts}
+	return RemoteShPod(pmo.ocClient, monitorNamespace, "statefulsets/"+prometheusK8s, queryArgs...)
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended-priv/util/prometheus_monitoring.go` around lines 189 - 197,
GetAlerts currently builds a single shell string and invokes RemoteShPod via "sh
-c", which is inconsistent with InstantQuery/RangeQuery/queryRules; change
GetAlerts to construct an argument slice (e.g. "curl", "-k", "-s", "-H",
fmt.Sprintf("Authorization: Bearer %v", pmo.Token), pmo.url+monitorAlerts) and
call RemoteShPod with those args directly (preserving the
pmo.ocClient.NotShowInfo()/defer SetShowInfo()), referencing
PrometheusMonitor.GetAlerts, RemoteShPod, pmo.Token, monitorAlerts,
monitorNamespace and prometheusK8s.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/extended-priv/inside_node_containers.go`:
- Line 172: The test currently logs the service account token via
logger.Debugf("SA TOKEN: %s", saToken), which exposes credentials; remove this
debug print or replace it with a non-sensitive message (e.g., log that the token
was obtained without printing its value) or explicitly mask/redact saToken
before logging; update the call site that references logger.Debugf and the
saToken local variable in inside_node_containers.go to avoid emitting the raw
token.

In `@test/extended-priv/mco_alerts.go`:
- Line 501: Fix the typo in the log messages that call
exutil.By(fmt.Sprintf(...)) where the string uses "Verfiy" — change it to
"Verify" in each occurrence (the exutil.By calls constructing messages with
fmt.Sprintf and params.stillPresentDuration and similar messages around the
alert checks); search for exutil.By(fmt.Sprintf("Verfiy")...) and replace
"Verfiy" with "Verify" in those messages so all log output is spelled correctly.
- Around line 434-436: The unsafe type assertions on alertJSON elements (e.g.,
alert := alertJSON[0]; annotationsMap :=
alert["annotations"].(map[string]interface{}) and similarly for labelsMap) can
panic if keys are missing or types differ; update these to use the comma-ok form
(v, ok := alert["annotations"]; annotationsMap, ok :=
v.(map[string]interface{})) and check ok (and presence) before proceeding, and
if the check fails call the test failure helper (e.g., t.Fatalf or
t.Fatalf-equivalent) with a clear message; apply the same defensive pattern to
the labelsMap extraction so the test fails gracefully instead of panicking.
- Around line 510-521: The function checkFixedAlert calls mcp.GetName() and runs
MCP-specific assertions even when mcp can be nil, causing a nil-pointer panic;
add a nil-check for the mcp parameter at the top of checkFixedAlert and skip or
short-circuit the MCP-specific block (the exutil.By/Eventually that calls
BeDegraded and the logger.Infof referencing mcp.GetName()) when mcp == nil,
while still running the alert existence check (the getAlertsByName/WithArguments
block) as before; ensure any log messages are adjusted so they don’t dereference
mcp and keep references to the identifiers checkFixedAlert, mcp.GetName,
BeDegraded, and getAlertsByName to locate the changes.

In `@test/extended-priv/node.go`:
- Around line 1395-1406: The CopyFromLocal call in BreakRebaseInNode ignores its
returned error, so if the file copy fails subsequent operations (involving
/tmp/rpm-ostree.broken and node.DebugNodeWithChroot) will run against a missing
file; update BreakRebaseInNode to capture and check the error from
node.CopyFromLocal (the brokenRpmOstree result from
generateTemplateAbsolutePath) and return it immediately (or wrap it with
context) before calling node.DebugNodeWithChroot, ensuring failures from
CopyFromLocal are surfaced.

In `@test/extended-priv/testdata/files/rpm-ostree-force-pivot-error.sh`:
- Around line 7-9: The script uses an invalid exit code and unquoted positional
parameters: replace the invalid `exit -1` with a valid exit status like `exit 1`
and change the invocation `/tmp/rpm-ostree $@` to quote the positional
parameters as `/tmp/rpm-ostree "$@"` so argument boundaries are preserved;
update these occurrences in the script (look for the `exit -1` token and the
`/tmp/rpm-ostree $@` invocation) to apply the fix.

---

Nitpick comments:
In `@test/extended-priv/inside_node_containers.go`:
- Around line 346-361: The CreateAndDigestOsImage flow can leak the local image
if pushImage() fails; after a successful buildImage() call register a deferred
cleanup by calling removeImage() (e.g., defer b.removeImage()) so the locally
built image is removed on any subsequent error, then call pushImage() and
finally digestImage(); ensure the deferred removeImage() only gets registered
after buildImage() succeeds and propagate errors from pushImage(),
digestImage(), and removeImage() appropriately.

In `@test/extended-priv/mco_alerts.go`:
- Around line 294-303: The anonymous goroutine that calls
node.DebugNodeWithChroot currently asserts with o.Expect inside the goroutine
and uses a hardcoded "sleep 300"; change it to propagate success/error back to
the test via a channel instead of calling o.Expect inside the goroutine and make
the sleep duration configurable to match the test timeout. Create an errCh (chan
error) and have the goroutine send the returned err (or nil on success) after it
runs node.DebugNodeWithChroot and restarts kubelet (and set fixed = true on
success); in the main test select on errCh with a timeout that matches the
test's 6-minute timeout (or define a shared constant like kubeletRestartDelay
and use it both in the chroot command and the select) so failures in
node.DebugNodeWithChroot (and the restart) reliably fail the test.

In `@test/extended-priv/util/prometheus_monitoring.go`:
- Around line 189-197: GetAlerts currently builds a single shell string and
invokes RemoteShPod via "sh -c", which is inconsistent with
InstantQuery/RangeQuery/queryRules; change GetAlerts to construct an argument
slice (e.g. "curl", "-k", "-s", "-H", fmt.Sprintf("Authorization: Bearer %v",
pmo.Token), pmo.url+monitorAlerts) and call RemoteShPod with those args directly
(preserving the pmo.ocClient.NotShowInfo()/defer SetShowInfo()), referencing
PrometheusMonitor.GetAlerts, RemoteShPod, pmo.Token, monitorAlerts,
monitorNamespace and prometheusK8s.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: de41ae65-949b-4bf6-a994-0efdf61054f6

📥 Commits

Reviewing files that changed from the base of the PR and between 455e155 and fb1747e.

📒 Files selected for processing (10)
  • test/extended-priv/const.go
  • test/extended-priv/containers.go
  • test/extended-priv/inside_node_containers.go
  • test/extended-priv/machineconfigpool.go
  • test/extended-priv/mco_alerts.go
  • test/extended-priv/node.go
  • test/extended-priv/testdata/files/rpm-ostree-force-pivot-error.sh
  • test/extended-priv/util.go
  • test/extended-priv/util/pods.go
  • test/extended-priv/util/prometheus_monitoring.go

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/extended-priv/mco_alerts.go`:
- Around line 279-280: Remove the MCP recovery check invoked by
checkFixedAlert(oc, coMcp, expectedAlertName) in these two specs (the calls at
the shown location and the similar one around lines 329-330); instead, either
delete the call or replace it with a check that only asserts the alert is no
longer firing (e.g., a helper that uses oc and expectedAlertName to verify the
alert is cleared) so you do not reintroduce unrelated MCP state assertions via
coMcp.
- Around line 362-367: The test is reading the RHCOS version from mcp which may
point to a non-CoreOS worker pool; change the node source to the
compact/coreOS-compatible MCP by using coMcp instead of mcp when calling
GetSortedNodesOrFail(), then call GetRHCOSVersion() on that node and keep the
existing error check and logger; update references of mcp.GetSortedNodesOrFail()
to coMcp.GetSortedNodesOrFail() (keeping the GetRHCOSVersion, o.Expect and
logger.Infof usage).
- Around line 478-502: Replace the fixed time.Sleep-based check with a polling
assertion that waits for the alert to transition to "firing": stop using
time.Sleep(params.pendingDuration) and the separate
o.Eventually(...).Should(o.HaveLen(1)) check; instead, use o.Eventually with a
function that calls getAlertsByName(oc, params.expectedAlertName), returns the
fetched alerts (or an error) and assert both that len(alerts) == 1 and that
alerts[0]["state"] == "firing" (or use o.HaveKeyWithValue on the returned
alert), so the test retries until the Prometheus evaluation actually sets state
to "firing" rather than relying on a single sleep. Ensure the code references
getAlertsByName, params.pendingDuration (remove its sleep usage),
alertJSON/alertErr, and the o.Eventually assertion to perform the combined
length+state check.

In `@test/extended-priv/node.go`:
- Around line 1402-1406: The shell pipeline passed to node.DebugNodeWithChroot
should fail fast so earlier failures (chmod/cp/nsenter) don't get masked by a
later successful command; change the "sh -c" invocation to use "sh -e -c" and
replace the semicolon-separated commands with && chaining so any command error
aborts the chain (update the command string used in node.DebugNodeWithChroot to
add -e and use && between "chmod +x /tmp/rpm-ostree.broken", "cp
/usr/bin/rpm-ostree /tmp/rpm-ostree", "nsenter --mount=/proc/1/ns/mnt mount
--bind /tmp/rpm-ostree.broken /usr/bin/rpm-ostree", and "restorecon -v
/usr/bin/rpm-ostree").

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e132a5ae-ff61-4356-89e0-a704ae5253c0

📥 Commits

Reviewing files that changed from the base of the PR and between fb1747e and 6627065.

📒 Files selected for processing (3)
  • test/extended-priv/mco_alerts.go
  • test/extended-priv/node.go
  • test/extended-priv/testdata/files/rpm-ostree-force-pivot-error.sh
✅ Files skipped from review due to trivial changes (1)
  • test/extended-priv/testdata/files/rpm-ostree-force-pivot-error.sh

Comment on lines +279 to +280
exutil.By("Check that the alert is not triggered anymore")
checkFixedAlert(oc, coMcp, expectedAlertName)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't reintroduce unrelated MCP recovery checks in these two specs.

Both tests skip the MCP degradation assertion when the alert fires, but these checkFixedAlert calls add it back on the recovery path. That makes the specs fail on unrelated pool state instead of only on the alert under test.

Proposed fix
-		checkFixedAlert(oc, coMcp, expectedAlertName)
+		checkFixedAlert(oc, nil, expectedAlertName)
@@
-		checkFixedAlert(oc, mcp, expectedAlertName)
+		checkFixedAlert(oc, nil, expectedAlertName)

Also applies to: 329-330

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended-priv/mco_alerts.go` around lines 279 - 280, Remove the MCP
recovery check invoked by checkFixedAlert(oc, coMcp, expectedAlertName) in these
two specs (the calls at the shown location and the similar one around lines
329-330); instead, either delete the call or replace it with a check that only
asserts the alert is no longer firing (e.g., a helper that uses oc and
expectedAlertName to verify the alert is cleared) so you do not reintroduce
unrelated MCP state assertions via coMcp.

g.It("[PolarionID:73841][OTP] KubeletHealthState alert [Disruptive]", func() {
var (
node = mcp.GetSortedNodesOrFail()[0]
fixed = false
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail
sed -n '286,326p' test/extended-priv/mco_alerts.go

Repository: openshift/machine-config-operator

Length of output: 2001


Fix data race on fixed variable using proper goroutine synchronization.

The variable fixed is written by a goroutine (line 296) and read by the main goroutine in the Eventually polling loop (line 325) without synchronization. This is a data race that will be caught by Go's race detector.

Use a channel to synchronize instead: initialize kubeletRestarted := make(chan struct{}), close it when the kubelet restarts, and check for closure in the Eventually function using a select statement.

Proposed fix
-			fixed                              = false
+			kubeletRestarted                   = make(chan struct{})
@@
-			fixed = true
+			close(kubeletRestarted)
@@
-		o.Eventually(func() bool { return fixed }, "6m", "20s").Should(o.BeTrue(), "Kubelet service was not restarted")
+		o.Eventually(func() bool {
+			select {
+			case <-kubeletRestarted:
+				return true
+			default:
+				return false
+			}
+		}, "6m", "20s").Should(o.BeTrue(), "Kubelet service was not restarted")

Comment on lines +362 to +367
exutil.By("Get the RHCOS version from a worker node for later use")
node := mcp.GetSortedNodesOrFail()[0]
rhcosVersion, err := node.GetRHCOSVersion()
o.Expect(err).NotTo(o.HaveOccurred(),
"Error getting the RHCOS version from node %s", node.GetName())
logger.Infof("RHCOS version from node %s: %s\n", node.GetName(), rhcosVersion)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Read the RHCOS version from coMcp, not mcp.

GetCompactCompatiblePool() can still resolve to a worker pool without CoreOS nodes. In that topology this spec fails before it ever exercises the alert, which is exactly the case GetCoreOsCompatiblePool() was added to handle.

Proposed fix
-		exutil.By("Get the RHCOS version from a worker node for later use")
-		node := mcp.GetSortedNodesOrFail()[0]
+		exutil.By("Get the RHCOS version from a CoreOS node for later use")
+		node := coMcp.GetSortedNodesOrFail()[0]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
exutil.By("Get the RHCOS version from a worker node for later use")
node := mcp.GetSortedNodesOrFail()[0]
rhcosVersion, err := node.GetRHCOSVersion()
o.Expect(err).NotTo(o.HaveOccurred(),
"Error getting the RHCOS version from node %s", node.GetName())
logger.Infof("RHCOS version from node %s: %s\n", node.GetName(), rhcosVersion)
exutil.By("Get the RHCOS version from a CoreOS node for later use")
node := coMcp.GetSortedNodesOrFail()[0]
rhcosVersion, err := node.GetRHCOSVersion()
o.Expect(err).NotTo(o.HaveOccurred(),
"Error getting the RHCOS version from node %s", node.GetName())
logger.Infof("RHCOS version from node %s: %s\n", node.GetName(), rhcosVersion)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended-priv/mco_alerts.go` around lines 362 - 367, The test is reading
the RHCOS version from mcp which may point to a non-CoreOS worker pool; change
the node source to the compact/coreOS-compatible MCP by using coMcp instead of
mcp when calling GetSortedNodesOrFail(), then call GetRHCOSVersion() on that
node and keep the existing error check and logger; update references of
mcp.GetSortedNodesOrFail() to coMcp.GetSortedNodesOrFail() (keeping the
GetRHCOSVersion, o.Expect and logger.Infof usage).

Comment on lines +478 to +502
if params.pendingDuration != 0 {
exutil.By("Verify that the alert is pending")
o.Expect(alert).To(o.HaveKeyWithValue("state", "pending"),
"Expected the alert's state to be 'pending', but it is not.")
logger.Infof("OK!\n")
}

exutil.By("Verify that the alert is in firing state")
if params.pendingDuration != 0 {
logger.Infof("Wait %s minutes until the alert is fired", params.pendingDuration)
time.Sleep(params.pendingDuration)
}

logger.Infof("Checking alert's state")
o.Eventually(func() ([]map[string]interface{}, error) {
alertJSON, alertErr = getAlertsByName(oc, params.expectedAlertName)
return alertJSON, alertErr
}, "15m", "20s").Should(o.HaveLen(1),
"Expected 1 %s alert and only 1 to be triggered!", params.expectedAlertName)

logger.Infof("Found %s alerts: %v", params.expectedAlertName, alertJSON)

alert = alertJSON[0]
o.Expect(alert).To(o.HaveKeyWithValue("state", "firing"),
"Expected the alert to report 'firing' state")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Wait for the firing transition, not just alert presence after a fixed sleep.

time.Sleep(params.pendingDuration) assumes the next Prometheus evaluation has already happened. If the alert is still pending on the next scrape, HaveLen(1) passes and the following state assertion flakes.

Proposed fix
-	if params.pendingDuration != 0 {
-		logger.Infof("Wait %s minutes until the alert is fired", params.pendingDuration)
-		time.Sleep(params.pendingDuration)
-	}
-
 	logger.Infof("Checking alert's state")
-	o.Eventually(func() ([]map[string]interface{}, error) {
+	o.Eventually(func() (string, error) {
 		alertJSON, alertErr = getAlertsByName(oc, params.expectedAlertName)
-		return alertJSON, alertErr
-	}, "15m", "20s").Should(o.HaveLen(1),
-		"Expected 1 %s alert and only 1 to be triggered!", params.expectedAlertName)
-
-	logger.Infof("Found %s alerts: %v", params.expectedAlertName, alertJSON)
-
-	alert = alertJSON[0]
-	o.Expect(alert).To(o.HaveKeyWithValue("state", "firing"),
+		if alertErr != nil {
+			return "", alertErr
+		}
+		if len(alertJSON) != 1 {
+			return "", fmt.Errorf("expected 1 %s alert, got %d", params.expectedAlertName, len(alertJSON))
+		}
+		state, ok := alertJSON[0]["state"].(string)
+		if !ok {
+			return "", fmt.Errorf("alert state is missing or not a string")
+		}
+		return state, nil
+	}, "15m", "20s").Should(o.Equal("firing"),
 		"Expected the alert to report 'firing' state")
-
-	logger.Infof("OK!\n")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended-priv/mco_alerts.go` around lines 478 - 502, Replace the fixed
time.Sleep-based check with a polling assertion that waits for the alert to
transition to "firing": stop using time.Sleep(params.pendingDuration) and the
separate o.Eventually(...).Should(o.HaveLen(1)) check; instead, use o.Eventually
with a function that calls getAlertsByName(oc, params.expectedAlertName),
returns the fetched alerts (or an error) and assert both that len(alerts) == 1
and that alerts[0]["state"] == "firing" (or use o.HaveKeyWithValue on the
returned alert), so the test retries until the Prometheus evaluation actually
sets state to "firing" rather than relying on a single sleep. Ensure the code
references getAlertsByName, params.pendingDuration (remove its sleep usage),
alertJSON/alertErr, and the o.Eventually assertion to perform the combined
length+state check.

Comment on lines +1402 to +1406
_, err := node.DebugNodeWithChroot("sh", "-c",
"chmod +x /tmp/rpm-ostree.broken; "+
"cp /usr/bin/rpm-ostree /tmp/rpm-ostree; "+
"nsenter --mount=/proc/1/ns/mnt mount --bind /tmp/rpm-ostree.broken /usr/bin/rpm-ostree; "+
"restorecon -v /usr/bin/rpm-ostree")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail
sed -n '1396,1406p' test/extended-priv/node.go
sed -n '1,11p' test/extended-priv/testdata/files/rpm-ostree-force-pivot-error.sh

Repository: openshift/machine-config-operator

Length of output: 1027


Use -e flag and && chaining to fail fast on the rpm-ostree setup chain.

The current sh -c with semicolons only reports the status of restorecon, so a failed chmod or cp can still report success. This is risky because the injected wrapper delegates every non-rebase call to /tmp/rpm-ostree; if that backup copy was never created, the node ends up more broken than the test intends.

Proposed fix
-	_, err := node.DebugNodeWithChroot("sh", "-c",
-		"chmod +x /tmp/rpm-ostree.broken; "+
-			"cp /usr/bin/rpm-ostree /tmp/rpm-ostree; "+
-			"nsenter --mount=/proc/1/ns/mnt mount --bind /tmp/rpm-ostree.broken /usr/bin/rpm-ostree; "+
-			"restorecon -v /usr/bin/rpm-ostree")
+	_, err := node.DebugNodeWithChroot("sh", "-ec",
+		"chmod +x /tmp/rpm-ostree.broken && "+
+			"cp /usr/bin/rpm-ostree /tmp/rpm-ostree && "+
+			"nsenter --mount=/proc/1/ns/mnt mount --bind /tmp/rpm-ostree.broken /usr/bin/rpm-ostree && "+
+			"restorecon -v /usr/bin/rpm-ostree")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
_, err := node.DebugNodeWithChroot("sh", "-c",
"chmod +x /tmp/rpm-ostree.broken; "+
"cp /usr/bin/rpm-ostree /tmp/rpm-ostree; "+
"nsenter --mount=/proc/1/ns/mnt mount --bind /tmp/rpm-ostree.broken /usr/bin/rpm-ostree; "+
"restorecon -v /usr/bin/rpm-ostree")
_, err := node.DebugNodeWithChroot("sh", "-ec",
"chmod +x /tmp/rpm-ostree.broken && "+
"cp /usr/bin/rpm-ostree /tmp/rpm-ostree && "+
"nsenter --mount=/proc/1/ns/mnt mount --bind /tmp/rpm-ostree.broken /usr/bin/rpm-ostree && "+
"restorecon -v /usr/bin/rpm-ostree")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended-priv/node.go` around lines 1402 - 1406, The shell pipeline
passed to node.DebugNodeWithChroot should fail fast so earlier failures
(chmod/cp/nsenter) don't get masked by a later successful command; change the
"sh -c" invocation to use "sh -e -c" and replace the semicolon-separated
commands with && chaining so any command error aborts the chain (update the
command string used in node.DebugNodeWithChroot to add -e and use && between
"chmod +x /tmp/rpm-ostree.broken", "cp /usr/bin/rpm-ostree /tmp/rpm-ostree",
"nsenter --mount=/proc/1/ns/mnt mount --bind /tmp/rpm-ostree.broken
/usr/bin/rpm-ostree", and "restorecon -v /usr/bin/rpm-ostree").

@sergiordlr sergiordlr force-pushed the migrate_alerts_test_cases branch from 6627065 to 10b0ea2 Compare March 20, 2026 09:36
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Mar 20, 2026

@sergiordlr: This pull request references MCO-2167 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

- What I did
Migrate the MCO alerts test cases from the private repository

Remove test case "[PolarionID:69184][OTP] Enable feature gate MachineConfigNodes [apigroup:machineconfiguration.openshift.io]" since MachineConfigNodes featuregate doesn't exist anymore.

- How to verify it
All tests must pass

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (5)
test/extended-priv/mco_alerts.go (4)

279-280: ⚠️ Potential issue | 🟠 Major

Keep these specs alert-scoped on recovery too.

Both tests intentionally skip MCP assertions when the alert fires. Passing coMcp/mcp here reintroduces unrelated pool-state failures on the recovery path.

Suggested fix
-		checkFixedAlert(oc, coMcp, expectedAlertName)
+		checkFixedAlert(oc, nil, expectedAlertName)
@@
-		checkFixedAlert(oc, mcp, expectedAlertName)
+		checkFixedAlert(oc, nil, expectedAlertName)

Also applies to: 329-330

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended-priv/mco_alerts.go` around lines 279 - 280, The call to
checkFixedAlert is passing coMcp/mcp and thus reintroduces unrelated MCP
assertions on recovery; change the calls (e.g. checkFixedAlert(oc, coMcp,
expectedAlertName)) to the alert-scoped form that omits the MCP argument (e.g.
checkFixedAlert(oc, expectedAlertName)) so recovery checks stay alert-scoped;
make the same change at the other occurrence around lines 329-330.

286-302: ⚠️ Potential issue | 🔴 Critical

Synchronize the kubelet restart flag.

fixed is written in a goroutine and polled from the main goroutine without synchronization, so this spec has a real data race and can flake under -race.

Suggested fix
-			fixed                              = false
+			kubeletRestarted                   = make(chan struct{})
@@
-			fixed = true
+			close(kubeletRestarted)
@@
-		o.Eventually(func() bool { return fixed }, "6m", "20s").Should(o.BeTrue(), "Kubelet service was not restarted")
+		o.Eventually(func() bool {
+			select {
+			case <-kubeletRestarted:
+				return true
+			default:
+				return false
+			}
+		}, "6m", "20s").Should(o.BeTrue(), "Kubelet service was not restarted")

Also applies to: 324-325

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended-priv/mco_alerts.go` around lines 286 - 302, The boolean
variable "fixed" is written inside the goroutine that calls
node.DebugNodeWithChroot and read from the main goroutine, causing a data race;
fix by synchronizing: replace the shared "fixed" flag with a concurrency-safe
mechanism such as an atomic boolean (e.g., atomic.StoreUint32/atomic.LoadUint32)
or a channel/WaitGroup to signal completion from the goroutine, update the
goroutine to signal (instead of setting "fixed" directly) and change the main
goroutine polling to use atomic.Load or receive from the channel; update the
occurrences around the goroutine and the reads of "fixed" (including the similar
occurrences at lines referenced as 324-325) to use the chosen synchronization
primitive.

485-502: ⚠️ Potential issue | 🟠 Major

Wait for the firing transition, not just alert presence.

Eventually(...).Should(o.HaveLen(1)) exits as soon as the alert exists. If the rule is still pending on that scrape, the next state == "firing" assertion is racy.

Suggested fix
 	exutil.By("Verify that the alert is in firing state")
-	if params.pendingDuration != 0 {
-		logger.Infof("Wait %s minutes until the alert is fired", params.pendingDuration)
-		time.Sleep(params.pendingDuration)
-	}
-
 	logger.Infof("Checking alert's state")
-	o.Eventually(func() ([]map[string]interface{}, error) {
+	o.Eventually(func() (string, error) {
 		alertJSON, alertErr = getAlertsByName(oc, params.expectedAlertName)
-		return alertJSON, alertErr
-	}, "15m", "20s").Should(o.HaveLen(1),
-		"Expected 1 %s alert and only 1 to be triggered!", params.expectedAlertName)
-
-	logger.Infof("Found %s alerts: %v", params.expectedAlertName, alertJSON)
-
-	alert = alertJSON[0]
-	o.Expect(alert).To(o.HaveKeyWithValue("state", "firing"),
-		"Expected the alert to report 'firing' state")
+		if alertErr != nil {
+			return "", alertErr
+		}
+		if len(alertJSON) != 1 {
+			return "", fmt.Errorf("expected 1 %s alert, got %d", params.expectedAlertName, len(alertJSON))
+		}
+		state, ok := alertJSON[0]["state"].(string)
+		if !ok {
+			return "", fmt.Errorf("alert state is missing or not a string")
+		}
+		return state, nil
+	}, "15m", "20s").Should(o.Equal("firing"),
+		"Expected the alert to report 'firing' state")
 
 	logger.Infof("OK!\n")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended-priv/mco_alerts.go` around lines 485 - 502, The test currently
waits only for presence via o.Eventually(...).Should(o.HaveLen(1)) which races
with a pending alert; change the check to wait for the alert's state to become
"firing" by polling getAlertsByName(oc, params.expectedAlertName) and returning
the alert's state (or an error) from the Eventually closure, then assert
Should(o.Equal("firing")) (or Should(o.ContainSubstring("firing")) if you return
a string) instead of relying on HaveLen; reference the existing getAlertsByName
call and the alert/state assertion flow (alertJSON, alert,
params.expectedAlertName) so the test only proceeds when state == "firing".

362-364: ⚠️ Potential issue | 🟠 Major

Use the CoreOS-compatible pool for GetRHCOSVersion().

GetCompactCompatiblePool() can still resolve to a worker pool with non-CoreOS nodes. In that topology this fails before the alert behavior is exercised.

Suggested fix
-		exutil.By("Get the RHCOS version from a worker node for later use")
-		node := mcp.GetSortedNodesOrFail()[0]
+		exutil.By("Get the RHCOS version from a CoreOS node for later use")
+		node := coMcp.GetSortedNodesOrFail()[0]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended-priv/mco_alerts.go` around lines 362 - 364, The code currently
picks a node via mcp.GetSortedNodesOrFail()[0] which can return non-CoreOS
workers; replace that with the CoreOS-compatible pool accessor so
GetRHCOSVersion() runs on a CoreOS node. Specifically, call
mcp.GetCoreOSCompatiblePool().GetSortedNodesOrFail()[0] (or the project’s
equivalent CoreOS-compatible pool method) and then invoke node.GetRHCOSVersion()
on that node.
test/extended-priv/node.go (1)

1402-1406: ⚠️ Potential issue | 🟠 Major

Fail fast when wiring the broken rpm-ostree wrapper.

With sh -c and ;, a failed chmod, cp, or mount can be masked by a later successful restorecon. That can leave /usr/bin/rpm-ostree bound to a wrapper that delegates to a missing /tmp/rpm-ostree.

Suggested fix
-	_, err := node.DebugNodeWithChroot("sh", "-c",
-		"chmod +x /tmp/rpm-ostree.broken; "+
-			"cp /usr/bin/rpm-ostree /tmp/rpm-ostree; "+
-			"nsenter --mount=/proc/1/ns/mnt mount --bind /tmp/rpm-ostree.broken /usr/bin/rpm-ostree; "+
-			"restorecon -v /usr/bin/rpm-ostree")
+	_, err := node.DebugNodeWithChroot("sh", "-ec",
+		"chmod +x /tmp/rpm-ostree.broken && "+
+			"cp /usr/bin/rpm-ostree /tmp/rpm-ostree && "+
+			"nsenter --mount=/proc/1/ns/mnt mount --bind /tmp/rpm-ostree.broken /usr/bin/rpm-ostree && "+
+			"restorecon -v /usr/bin/rpm-ostree")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended-priv/node.go` around lines 1402 - 1406, The command string
passed to DebugNodeWithChroot currently chains operations with ';' so a later
success (restorecon) can hide earlier failures; update the invocation of
DebugNodeWithChroot to make the shell fail fast—either prefix the command with
"set -euo pipefail;" or replace ';' with '&&' between the steps (the sequence in
DebugNodeWithChroot: "chmod +x /tmp/rpm-ostree.broken", "cp /usr/bin/rpm-ostree
/tmp/rpm-ostree", "nsenter --mount=/proc/1/ns/mnt mount --bind
/tmp/rpm-ostree.broken /usr/bin/rpm-ostree", "restorecon -v
/usr/bin/rpm-ostree") so any failed step aborts immediately and prevents leaving
/usr/bin/rpm-ostree bound to a non-existent target.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/extended-priv/mco_alerts.go`:
- Around line 294-303: The goroutine that stops kubelet via
node.DebugNodeWithChroot("sh", "-c", "systemctl stop kubelet.service; sleep 300;
systemctl start kubelet.service") can leave kubelet stopped if the test aborts
early; register a g.DeferCleanup that ensures "systemctl start kubelet.service"
is invoked on the node to restore state. Concretely, keep the stop logic in the
goroutine (or call node.DebugNodeWithChroot to stop), but immediately call
g.DeferCleanup with a closure that calls node.DebugNodeWithChroot("sh", "-c",
"systemctl start kubelet.service") (handle/log errors but don’t panic), so
kubelet is restarted even if checkFiredAlert or later assertions abort the spec;
reference node.DebugNodeWithChroot and g.DeferCleanup when making the change.

---

Duplicate comments:
In `@test/extended-priv/mco_alerts.go`:
- Around line 279-280: The call to checkFixedAlert is passing coMcp/mcp and thus
reintroduces unrelated MCP assertions on recovery; change the calls (e.g.
checkFixedAlert(oc, coMcp, expectedAlertName)) to the alert-scoped form that
omits the MCP argument (e.g. checkFixedAlert(oc, expectedAlertName)) so recovery
checks stay alert-scoped; make the same change at the other occurrence around
lines 329-330.
- Around line 286-302: The boolean variable "fixed" is written inside the
goroutine that calls node.DebugNodeWithChroot and read from the main goroutine,
causing a data race; fix by synchronizing: replace the shared "fixed" flag with
a concurrency-safe mechanism such as an atomic boolean (e.g.,
atomic.StoreUint32/atomic.LoadUint32) or a channel/WaitGroup to signal
completion from the goroutine, update the goroutine to signal (instead of
setting "fixed" directly) and change the main goroutine polling to use
atomic.Load or receive from the channel; update the occurrences around the
goroutine and the reads of "fixed" (including the similar occurrences at lines
referenced as 324-325) to use the chosen synchronization primitive.
- Around line 485-502: The test currently waits only for presence via
o.Eventually(...).Should(o.HaveLen(1)) which races with a pending alert; change
the check to wait for the alert's state to become "firing" by polling
getAlertsByName(oc, params.expectedAlertName) and returning the alert's state
(or an error) from the Eventually closure, then assert Should(o.Equal("firing"))
(or Should(o.ContainSubstring("firing")) if you return a string) instead of
relying on HaveLen; reference the existing getAlertsByName call and the
alert/state assertion flow (alertJSON, alert, params.expectedAlertName) so the
test only proceeds when state == "firing".
- Around line 362-364: The code currently picks a node via
mcp.GetSortedNodesOrFail()[0] which can return non-CoreOS workers; replace that
with the CoreOS-compatible pool accessor so GetRHCOSVersion() runs on a CoreOS
node. Specifically, call mcp.GetCoreOSCompatiblePool().GetSortedNodesOrFail()[0]
(or the project’s equivalent CoreOS-compatible pool method) and then invoke
node.GetRHCOSVersion() on that node.

In `@test/extended-priv/node.go`:
- Around line 1402-1406: The command string passed to DebugNodeWithChroot
currently chains operations with ';' so a later success (restorecon) can hide
earlier failures; update the invocation of DebugNodeWithChroot to make the shell
fail fast—either prefix the command with "set -euo pipefail;" or replace ';'
with '&&' between the steps (the sequence in DebugNodeWithChroot: "chmod +x
/tmp/rpm-ostree.broken", "cp /usr/bin/rpm-ostree /tmp/rpm-ostree", "nsenter
--mount=/proc/1/ns/mnt mount --bind /tmp/rpm-ostree.broken /usr/bin/rpm-ostree",
"restorecon -v /usr/bin/rpm-ostree") so any failed step aborts immediately and
prevents leaving /usr/bin/rpm-ostree bound to a non-existent target.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 629f9b19-1856-4600-bc20-96ee86a64e1b

📥 Commits

Reviewing files that changed from the base of the PR and between 6627065 and 10b0ea2.

📒 Files selected for processing (4)
  • test/extended-priv/mco_alerts.go
  • test/extended-priv/mco_machineconfignode.go
  • test/extended-priv/node.go
  • test/extended-priv/testdata/files/rpm-ostree-force-pivot-error.sh
💤 Files with no reviewable changes (1)
  • test/extended-priv/mco_machineconfignode.go
✅ Files skipped from review due to trivial changes (1)
  • test/extended-priv/testdata/files/rpm-ostree-force-pivot-error.sh

Comment on lines +294 to +303
exutil.By("Break kubelet")
// We stop the kubelet service to break the node and after 5 minutes we start it again to fix the node
go func() {
defer g.GinkgoRecover()
_, err := node.DebugNodeWithChroot("sh", "-c", "systemctl stop kubelet.service; sleep 300; systemctl start kubelet.service")
o.Expect(err).NotTo(o.HaveOccurred(),
"Error stopping and restarting kubelet in %s", node)
logger.Infof("Kubelet service has been restarted again")
fixed = true
}()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's check the file exists and read the specified section
head -n 310 test/extended-priv/mco_alerts.go | tail -n 30

Repository: openshift/machine-config-operator

Length of output: 1880


🏁 Script executed:

# Let's also see more context around this section
sed -n '280,320p' test/extended-priv/mco_alerts.go

Repository: openshift/machine-config-operator

Length of output: 1880


🏁 Script executed:

# Search for DeferCleanup usage in test files to understand the pattern
rg "DeferCleanup" test/ -A 2 -B 1 --max-count 10

Repository: openshift/machine-config-operator

Length of output: 1880


🏁 Script executed:

# Check if checkFiredAlert aborts/fails the test
rg "checkFiredAlert" test/extended-priv/mco_alerts.go -B 2 -A 5

Repository: openshift/machine-config-operator

Length of output: 1880


🏁 Script executed:

# Look for other kubelet or similar service disruption patterns in tests
rg "systemctl (stop|start)" test/ -B 3 -A 3

Repository: openshift/machine-config-operator

Length of output: 1880


🌐 Web query:

Ginkgo DeferCleanup pattern Go testing

💡 Result:

The DeferCleanup pattern in Ginkgo (v2+) is a cleanup mechanism for Go tests that registers dynamic cleanup functions (AfterEach/AfterAll/AfterSuite equivalents) called right after the spec's It subject completes but before static AfterEach/JustAfterEach nodes. It runs in LIFO order among DeferCleanups and after all After* nodes, simplifying pairing setup with cleanup without distant AfterEach blocks. Key benefits: - Keeps cleanup close to setup code (e.g., in BeforeEach or It). - Avoids closure variables for state passing. - Supports context.Context (SpecContext) for timeouts/interrupts. - Function returns 0 or 1 value (error); non-nil error fails the spec. - Enables full testing.TB compatibility (e.g., t.Cleanup via GinkgoT.Cleanup). Example from official docs: Describe("Reporting book weight", func { BeforeEach(func { originalWeightUnits := os.Getenv("WEIGHT_UNITS") os.Setenv("WEIGHT_UNITS", "oz") DeferCleanup(func { err := os.Setenv("WEIGHT_UNITS", originalWeightUnits) Expect(err).NotTo(HaveOccurred) }) }) It("...", func { ... }) }) Here, DeferCleanup restores the env var immediately after setup, closer to where it's changed. Another example (file cleanup): It("Should cleanup the created file", func { file, err := os.Create("test.txt") Expect(err).NotTo(HaveOccurred) DeferCleanup(os.Remove, "test.txt") // Use file... }) Execution order: BeforeEach(s) -> JustBeforeEach(s) -> It -> DeferCleanup(s) (LIFO) -> AfterEach(s) -> JustAfterEach(s). Mixing with After* can lead to surprises (DeferCleanup always last). Use DeferCleanup over AfterEach for most cases, especially integration tests needing resource teardown (files, servers, DBs). Docs recommend it as cleaner/simpler.

Citations:


🌐 Web query:

Ginkgo test cleanup early abort assertions

💡 Result:

In the Ginkgo BDD testing framework for Go, when an assertion fails (via Fail or a failing Gomega Expect), Ginkgo immediately aborts the current node (It, BeforeEach, etc.) by panicking. It rescues the panic, marks the spec as failed, and proceeds to run cleanup: all registered AfterEach, JustAfterEach, and DeferCleanup functions. This ensures resources are cleaned up even on early abort. Key mechanisms: - Fail/Expect panics to fast-fail the node (cannot be disabled). - Cleanup runs reliably post-abort via panic recovery. - Use DeferCleanup(setupFunc) right after setup for paired cleanup (Ginkgo v2+). - For goroutines with assertions: defer GinkgoRecover to allow proper failure propagation without suite panic. - Timeouts (SpecTimeout/NodeTimeout) also interrupt and trigger cleanup. Official docs confirm: "When a failure occurs in a BeforeEach, JustBeforeEach, or It closure Ginkgo halts execution of the current spec and cleans up by invoking any registered AfterEach or JustAfterEach closures (and any registered DeferCleanup closures if applicable)."

Citations:


Add failure-path cleanup for the kubelet disruption.

If checkFiredAlert or any later assertion aborts the spec early, the suite moves on while kubelet is still stopped for up to five minutes. In this serial file, that cascades into unrelated failures. Use g.DeferCleanup to ensure kubelet restarts even when an assertion fails.

Suggested fix
 		exutil.By("Break kubelet")
+		g.DeferCleanup(func() {
+			_, _ = node.DebugNodeWithChroot("systemctl", "start", "kubelet.service")
+			o.Eventually(node, "6m", "20s").Should(HaveConditionField("Ready", "status", TrueString),
+				"Node %s didn't become ready during cleanup", node)
+		})
 		// We stop the kubelet service to break the node and after 5 minutes we start it again to fix the node
 		go func() {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended-priv/mco_alerts.go` around lines 294 - 303, The goroutine that
stops kubelet via node.DebugNodeWithChroot("sh", "-c", "systemctl stop
kubelet.service; sleep 300; systemctl start kubelet.service") can leave kubelet
stopped if the test aborts early; register a g.DeferCleanup that ensures
"systemctl start kubelet.service" is invoked on the node to restore state.
Concretely, keep the stop logic in the goroutine (or call
node.DebugNodeWithChroot to stop), but immediately call g.DeferCleanup with a
closure that calls node.DebugNodeWithChroot("sh", "-c", "systemctl start
kubelet.service") (handle/log errors but don’t panic), so kubelet is restarted
even if checkFiredAlert or later assertions abort the spec; reference
node.DebugNodeWithChroot and g.DeferCleanup when making the change.

@sergiordlr sergiordlr force-pushed the migrate_alerts_test_cases branch from 10b0ea2 to 7479576 Compare March 20, 2026 13:02
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
test/extended-priv/node.go (1)

1402-1406: ⚠️ Potential issue | 🟠 Major

Use && chaining to fail fast on the setup chain.

The shell command still uses semicolons. If chmod or cp fails, execution continues—potentially leaving the node broken without a valid /tmp/rpm-ostree backup (which the injected script delegates to for non-rebase calls).

🐛 Proposed fix
 	_, err := node.DebugNodeWithChroot("sh", "-c",
-		"chmod +x /tmp/rpm-ostree.broken; "+
-			"cp /usr/bin/rpm-ostree /tmp/rpm-ostree; "+
-			"nsenter --mount=/proc/1/ns/mnt mount --bind /tmp/rpm-ostree.broken /usr/bin/rpm-ostree; "+
+		"chmod +x /tmp/rpm-ostree.broken && "+
+			"cp /usr/bin/rpm-ostree /tmp/rpm-ostree && "+
+			"nsenter --mount=/proc/1/ns/mnt mount --bind /tmp/rpm-ostree.broken /usr/bin/rpm-ostree && "+
 			"restorecon -v /usr/bin/rpm-ostree")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended-priv/node.go` around lines 1402 - 1406, The shell command
passed to node.DebugNodeWithChroot uses semicolons which allows later commands
to run even if earlier ones fail; change the command string to use && between
the commands ("chmod +x ... && cp ... && nsenter ... && restorecon ...") so the
chain fails fast on the first error, preserving the same arguments to
node.DebugNodeWithChroot and keeping the overall command quoted as a single
parameter.
🧹 Nitpick comments (2)
test/extended-priv/inside_node_containers.go (2)

297-298: Incomplete doc comment.

The comment says "by:" but doesn't list the steps. Consider completing the description or removing "by:".

📝 Suggested fix
-// removeImage removes the built OS image from the node to free up space by:
+// removeImage removes the built OS image from the node to free up space.
+// It uses "podman manifest rm" for manifest builds or "podman rmi" for single images.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended-priv/inside_node_containers.go` around lines 297 - 298, The doc
comment for the method removeImage on OsImageBuilderInNode is incomplete ("by:"
with no steps); either complete the comment by describing the specific steps the
function performs to remove the built OS image (e.g., stop containers, delete
files, free space) or remove the trailing "by:" so the comment is a concise
one-line summary; update the comment immediately above func (b
*OsImageBuilderInNode) removeImage() error and ensure it accurately reflects the
implementation.

320-321: Incomplete doc comment.

Same issue as removeImage - the comment says "by:" but doesn't complete the description.

📝 Suggested fix
-// digestImage inspects the pushed image and returns its digest reference by:
+// digestImage inspects the pushed image using skopeo and returns its digest reference
+// in the format Name@Digest.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended-priv/inside_node_containers.go` around lines 320 - 321, The doc
comment for digestImage is incomplete ("by:") similar to removeImage; update the
comment above the digestImage method of OsImageBuilderInNode to be a complete
sentence describing what the function does (e.g., "digestImage inspects the
pushed image and returns its digest reference by inspecting the image manifest
and extracting the digest") or remove the trailing "by:" so the comment reads:
"digestImage inspects the pushed image and returns its digest reference." Ensure
the comment matches the intent of the function and follows Go doc comment style
for digestImage (and mirror the fix for removeImage if needed).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@test/extended-priv/node.go`:
- Around line 1402-1406: The shell command passed to node.DebugNodeWithChroot
uses semicolons which allows later commands to run even if earlier ones fail;
change the command string to use && between the commands ("chmod +x ... && cp
... && nsenter ... && restorecon ...") so the chain fails fast on the first
error, preserving the same arguments to node.DebugNodeWithChroot and keeping the
overall command quoted as a single parameter.

---

Nitpick comments:
In `@test/extended-priv/inside_node_containers.go`:
- Around line 297-298: The doc comment for the method removeImage on
OsImageBuilderInNode is incomplete ("by:" with no steps); either complete the
comment by describing the specific steps the function performs to remove the
built OS image (e.g., stop containers, delete files, free space) or remove the
trailing "by:" so the comment is a concise one-line summary; update the comment
immediately above func (b *OsImageBuilderInNode) removeImage() error and ensure
it accurately reflects the implementation.
- Around line 320-321: The doc comment for digestImage is incomplete ("by:")
similar to removeImage; update the comment above the digestImage method of
OsImageBuilderInNode to be a complete sentence describing what the function does
(e.g., "digestImage inspects the pushed image and returns its digest reference
by inspecting the image manifest and extracting the digest") or remove the
trailing "by:" so the comment reads: "digestImage inspects the pushed image and
returns its digest reference." Ensure the comment matches the intent of the
function and follows Go doc comment style for digestImage (and mirror the fix
for removeImage if needed).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bf9a33a5-b95d-47da-8051-87eef79f4639

📥 Commits

Reviewing files that changed from the base of the PR and between 10b0ea2 and 7479576.

📒 Files selected for processing (5)
  • test/extended-priv/inside_node_containers.go
  • test/extended-priv/mco_alerts.go
  • test/extended-priv/mco_machineconfignode.go
  • test/extended-priv/node.go
  • test/extended-priv/testdata/files/rpm-ostree-force-pivot-error.sh
💤 Files with no reviewable changes (1)
  • test/extended-priv/mco_machineconfignode.go
✅ Files skipped from review due to trivial changes (1)
  • test/extended-priv/testdata/files/rpm-ostree-force-pivot-error.sh
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/extended-priv/mco_alerts.go

return nil
}

// removeImage removes the built OS image from the node to free up space by:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment seems like it is missing a list of steps after by:

return nil
}

// digestImage inspects the pushed image and returns its digest reference by:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment seems like it is missing a list of steps after by:

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 20, 2026

@sergiordlr: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-op-ocl-part1 7479576 link false /test e2e-gcp-op-ocl-part1

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants