NO-JIRA: refactor(tnf): separate precondition timeouts from recovery cluster health by jaypoulz · Pull Request #30812 · openshift/origin

jaypoulz · 2026-02-25T20:30:53Z

Updated TNF recovery tests to have a fail-fast pattern so we don't time out in disaster cases.

Precondition checks use 5m for operators and 1m for etcd; recovery/reboot keeps 15m. Ensure helpers accept timeout parameter.

Summary by CodeRabbit

Tests
- Introduced explicit precondition timeouts (5 min for cluster, 1 min for etcd) and refactored health-check orchestration to use them.
- Updated health-check helpers to accept explicit timeout parameters and aligned logging/timing accordingly.
- Improved stability and reduced flakiness of cluster and etcd health validations in tests.

openshift-ci-robot · 2026-02-25T20:30:55Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

openshift-ci-robot · 2026-02-25T20:30:57Z

@jaypoulz: This pull request explicitly references no jira issue.

Details

In response to this:

Updated TNF recovery tests to have a fail-fast pattern so we don't time out in disaster cases.

Precondition checks use 5m for operators and 1m for etcd; recovery/reboot keeps 15m. Ensure helpers accept timeout parameter.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-02-25T20:31:17Z

Walkthrough

Adds two precondition timeout constants and refactors health-check helpers to accept explicit timeout parameters; updates call sites to use the new timeouts instead of hard-coded durations and preserves an existing debug container timeout.

Changes

Cohort / File(s)	Summary
Timeouts & Health-check helpers `test/extended/two_node/utils/common.go`	Adds `preconditionClusterHealthyTimeout = 5 * time.Minute` and `preconditionEtcdHealthyTimeout = 1 * time.Minute`; retains `debugContainerTimeout = 60 * time.Second`. Refactors `ensureClusterOperatorHealthy(oc exutil.CLI, timeout time.Duration)`, `ensureEtcdPodsAreRunning(oc exutil.CLI, timeout time.Duration)`, and `ensureEtcdHasTwoVotingMembers(nodes corev1.NodeList, ecf helpers.EtcdClientFactoryImpl, timeout time.Duration)` to accept explicit timeouts. Updates callers (e.g., `SkipIfClusterIsNotHealthy`) to pass the new precondition timeouts and aligns log messages and internal timeout usage.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Test Structure And Quality	❓ Inconclusive	The file referenced in the PR summary does not exist in the repository; searches for related functions and timeout constants yield no results.	Verify the correct file path exists and is accessible, confirm the PR changes are committed to the current branch, or provide the accurate file location.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately reflects the main refactoring objective: separating precondition timeouts from recovery cluster health in the TNF tests, which is precisely what the changeset accomplishes.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Stable And Deterministic Test Names	✅ Passed	The PR modifies a utility/helper file (common.go) with timeout constants and health-check functions, not test definitions with titles. No test names are changed.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.5.0)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

test/extended/two_node/utils/common.go (1)

1185-1205: ⚠️ Potential issue | 🟠 Major

ensureEtcdPodsAreRunning bypasses its timeout on pod-count failures.

Line 1203 returns immediately when fewer than 2 pods are running, so this path does not “wait” despite accepting a timeout. This can cause false precondition skips during brief transitions.

Proposed fix

 		} else {
 			runningPods := 0
 			for _, pod := range etcdPods.Items {
 				if pod.Status.Phase == corev1.PodRunning {
 					runningPods++
 				}
 			}
-			if runningPods < 2 {
-				return fmt.Errorf("expected at least 2 etcd pods running, found %d", runningPods)
-			}
-
-			framework.Logf("SUCCESS: found the 2 expected Etcd pods")
-			return nil
+			if runningPods < 2 {
+				err = fmt.Errorf("expected at least 2 etcd pods running, found %d", runningPods)
+			} else {
+				framework.Logf("SUCCESS: found the 2 expected Etcd pods")
+				return nil
+			}
 		}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/extended/two_node/utils/common.go` around lines 1185 - 1205, The
function ensureEtcdPodsAreRunning currently returns immediately when runningPods
< 2, bypassing the provided timeout; change the logic so pod-count failures set
a local error (e.g., err = fmt.Errorf("expected at least 2 etcd pods running,
found %d", runningPods)) and allow the for loop to continue until the context
deadline instead of returning; also use the created ctx when calling
oc.AdminKubeClient().CoreV1().Pods(...).List(ctx, ...) and after the loop
returns the most recent error if the context times out (or nil on success),
optionally adding a short sleep/backoff between retries to avoid tight looping.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@test/extended/two_node/utils/common.go`:
- Around line 1185-1205: The function ensureEtcdPodsAreRunning currently returns
immediately when runningPods < 2, bypassing the provided timeout; change the
logic so pod-count failures set a local error (e.g., err = fmt.Errorf("expected
at least 2 etcd pods running, found %d", runningPods)) and allow the for loop to
continue until the context deadline instead of returning; also use the created
ctx when calling oc.AdminKubeClient().CoreV1().Pods(...).List(ctx, ...) and
after the loop returns the most recent error if the context times out (or nil on
success), optionally adding a short sleep/backoff between retries to avoid tight
looping.

ℹ️ Review info

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between a631c60 and 5d188d0.

📒 Files selected for processing (1)

test/extended/two_node/utils/common.go

jaypoulz · 2026-02-25T20:38:10Z

/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-ipv6-recovery-techpreview

openshift-ci · 2026-02-25T20:38:14Z

@jaypoulz: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-ipv6-recovery-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/e691a440-1289-11f1-936b-8aa12b8bffa6-0

eggfoobar · 2026-02-25T20:40:57Z

/lgtm

eggfoobar · 2026-02-26T05:28:21Z

/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-ipv6-recovery-techpreview

openshift-ci · 2026-02-26T05:28:28Z

@eggfoobar: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-ipv6-recovery-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/f7658120-12d3-11f1-8774-b61f9d4b6895-0

eggfoobar · 2026-02-26T13:11:17Z

/lgtm

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

test/extended/two_node/utils/common.go (1)

1186-1205: ⚠️ Potential issue | 🟠 Major

ensureEtcdPodsAreRunning does not actually wait up to the provided timeout.

Line 1191 uses context.Background() instead of the timeout-scoped ctx, so the API call itself is not bounded by this function's timeout. Additionally, line 1204 returns immediately when fewer than 2 etcd pods are running, bypassing the timeout check logic entirely. The function should retry while the timeout context is still valid instead of failing immediately.

Suggested fix

 func ensureEtcdPodsAreRunning(oc *exutil.CLI, timeout time.Duration) error {
 	framework.Logf("Ensure Etcd pods are running (timeout: %v)", timeout)
 	ctx, cancel := context.WithTimeout(context.Background(), timeout)
 	defer cancel()
 	for {
-		etcdPods, err := oc.AdminKubeClient().CoreV1().Pods("openshift-etcd").List(context.Background(), metav1.ListOptions{
+		etcdPods, err := oc.AdminKubeClient().CoreV1().Pods("openshift-etcd").List(ctx, metav1.ListOptions{
 			LabelSelector: "app=etcd",
 		})
 		if err != nil {
 			err = fmt.Errorf("failed to retrieve etcd pods: %v", err)
 		} else {
 			runningPods := 0
 			for _, pod := range etcdPods.Items {
 				if pod.Status.Phase == corev1.PodRunning {
 					runningPods++
 				}
 			}
 			if runningPods < 2 {
-				return fmt.Errorf("expected at least 2 etcd pods running, found %d", runningPods)
+				err = fmt.Errorf("expected at least 2 etcd pods running, found %d", runningPods)
+			} else {
+				framework.Logf("SUCCESS: found the 2 expected Etcd pods")
+				return nil
 			}
-
-			framework.Logf("SUCCESS: found the 2 expected Etcd pods")
-			return nil
 		}
 
 		select {
 		case <-ctx.Done():
 			return err
 		default:
 		}
 		time.Sleep(FiveSecondPollInterval)
 	}
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/extended/two_node/utils/common.go` around lines 1186 - 1205, The
ensureEtcdPodsAreRunning function currently uses context.Background() for the
List call and returns immediately when fewer than 2 pods are running; update it
to use the timeout-scoped ctx for the API call
(oc.AdminKubeClient().CoreV1().Pods(...).List(ctx,...)) and change the logic so
that when runningPods < 2 you do not return an error immediately but instead
loop/retry until ctx is done (check <-ctx.Done() to break and return a timeout
error). Ensure you still cancel the context via defer cancel(), and only return
success when the runningPods condition is met or return a clear timeout error
after ctx expires.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@test/extended/two_node/utils/common.go`:
- Around line 1186-1205: The ensureEtcdPodsAreRunning function currently uses
context.Background() for the List call and returns immediately when fewer than 2
pods are running; update it to use the timeout-scoped ctx for the API call
(oc.AdminKubeClient().CoreV1().Pods(...).List(ctx,...)) and change the logic so
that when runningPods < 2 you do not return an error immediately but instead
loop/retry until ctx is done (check <-ctx.Done() to break and return a timeout
error). Ensure you still cancel the context via defer cancel(), and only return
success when the runningPods condition is met or return a clear timeout error
after ctx expires.

ℹ️ Review info

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 5d188d0 and 23dc7fa.

📒 Files selected for processing (1)

test/extended/two_node/utils/common.go

…cluster health Precondition checks use 5m for operators and 1m for etcd; recovery/reboot keeps 15m. Ensure helpers accept timeout parameter. Co-authored-by: Cursor <cursoragent@cursor.com>

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

test/extended/two_node/utils/common.go (1)
1185-1217: ⚠️ Potential issue | 🟠 Major

ensureEtcdPodsAreRunning returns immediately on insufficient pods instead of retrying within timeout, and ignores the timeout context.

The function creates a timeout-bound context at line 1188 but never uses it. The Pod list call at line 1191 uses context.Background() instead of ctx, so the Kubernetes API call has no timeout constraint. Additionally, line 1203 returns an error immediately when fewer than 2 pods are running, preventing the function from retrying within the timeout window. The loop should only exit on timeout or success; transient pod unavailability should trigger a retry after the sleep interval, not an immediate failure.

Fix by: (1) changing context.Background() to ctx in the List call, and (2) inverting the condition at line 1203 so that success (runningPods >= 2) returns immediately, while insufficient pods continues the loop and retries.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/two_node/utils/common.go` around lines 1185 - 1217, In
ensureEtcdPodsAreRunning, use the timeout-bound context for the API call and
only return on success: replace the List call's context.Background() with ctx,
and invert the pods check so that if runningPods >= 2 you return nil (success),
otherwise do not return an error immediately but continue the loop (allowing the
select on ctx.Done() to handle timeout and then return the last error). Ensure
the function still logs success via framework.Logf when returning nil and
preserves the existing err wrapping for List failures.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@test/extended/two_node/utils/common.go`:
- Around line 1185-1217: In ensureEtcdPodsAreRunning, use the timeout-bound
context for the API call and only return on success: replace the List call's
context.Background() with ctx, and invert the pods check so that if runningPods
>= 2 you return nil (success), otherwise do not return an error immediately but
continue the loop (allowing the select on ctx.Done() to handle timeout and then
return the last error). Ensure the function still logs success via
framework.Logf when returning nil and preserves the existing err wrapping for
List failures.

ℹ️ Review info

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 23dc7fa and 6013d7f.

📒 Files selected for processing (1)

test/extended/two_node/utils/common.go

openshift-ci-robot · 2026-02-26T14:41:28Z

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

eggfoobar · 2026-02-26T14:49:18Z

/verified bypass
/lgtm

Verified not required, this is just a simple change to timeouts that stops us from waiting too long.

openshift-ci-robot · 2026-02-26T14:49:31Z

@eggfoobar: The verified label has been added.

Details

In response to this:

/verified bypass
/lgtm

Verified not required, this is just a simple change to timeouts that stops us from waiting too long.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2026-02-26T14:52:27Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: eggfoobar, jaypoulz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~test/extended/two_node/OWNERS~~ [eggfoobar,jaypoulz]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jaypoulz · 2026-02-26T18:02:38Z

/retest-required

openshift-ci · 2026-02-26T22:44:00Z

@jaypoulz: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 25, 2026

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 25, 2026

openshift-ci bot requested review from eggfoobar and qJkee February 25, 2026 20:31

coderabbitai bot reviewed Feb 25, 2026

View reviewed changes

openshift-ci bot assigned eggfoobar Feb 25, 2026

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 25, 2026

jaypoulz force-pushed the tnf-skip-aggresively branch from 5d188d0 to 23dc7fa Compare February 26, 2026 13:10

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Feb 26, 2026

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 26, 2026

coderabbitai bot reviewed Feb 26, 2026

View reviewed changes

NO-JIRA: refactor(tnf): separate precondition timeouts from recovery …

6013d7f

…cluster health Precondition checks use 5m for operators and 1m for etcd; recovery/reboot keeps 15m. Ensure helpers accept timeout parameter. Co-authored-by: Cursor <cursoragent@cursor.com>

jaypoulz force-pushed the tnf-skip-aggresively branch from 23dc7fa to 6013d7f Compare February 26, 2026 14:09

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Feb 26, 2026

coderabbitai bot reviewed Feb 26, 2026

View reviewed changes

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Feb 26, 2026

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 26, 2026

openshift-merge-bot bot merged commit 2a1c975 into openshift:main Feb 26, 2026
21 checks passed

Conversation

jaypoulz commented Feb 25, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

openshift-ci-robot commented Feb 25, 2026

Uh oh!

openshift-ci-robot commented Feb 25, 2026

Uh oh!

coderabbitai bot commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

jaypoulz commented Feb 25, 2026

Uh oh!

openshift-ci bot commented Feb 25, 2026

Uh oh!

eggfoobar commented Feb 25, 2026

Uh oh!

eggfoobar commented Feb 26, 2026

Uh oh!

openshift-ci bot commented Feb 26, 2026

Uh oh!

eggfoobar commented Feb 26, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci-robot commented Feb 26, 2026

Uh oh!

eggfoobar commented Feb 26, 2026

Uh oh!

openshift-ci-robot commented Feb 26, 2026

Uh oh!

openshift-ci bot commented Feb 26, 2026

Uh oh!

jaypoulz commented Feb 26, 2026

Uh oh!

openshift-ci bot commented Feb 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jaypoulz commented Feb 25, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 25, 2026 •

edited

Loading