Skip to content

NO-JIRA: refactor(tnf): separate precondition timeouts from recovery cluster health#30812

Merged
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
jaypoulz:tnf-skip-aggresively
Feb 26, 2026
Merged

NO-JIRA: refactor(tnf): separate precondition timeouts from recovery cluster health#30812
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
jaypoulz:tnf-skip-aggresively

Conversation

@jaypoulz
Copy link
Contributor

@jaypoulz jaypoulz commented Feb 25, 2026

Updated TNF recovery tests to have a fail-fast pattern so we don't time out in disaster cases.

Precondition checks use 5m for operators and 1m for etcd; recovery/reboot keeps 15m. Ensure helpers accept timeout parameter.

Summary by CodeRabbit

  • Tests
    • Introduced explicit precondition timeouts (5 min for cluster, 1 min for etcd) and refactored health-check orchestration to use them.
    • Updated health-check helpers to accept explicit timeout parameters and aligned logging/timing accordingly.
    • Improved stability and reduced flakiness of cluster and etcd health validations in tests.

@openshift-ci-robot
Copy link

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 25, 2026
@openshift-ci-robot
Copy link

@jaypoulz: This pull request explicitly references no jira issue.

Details

In response to this:

Updated TNF recovery tests to have a fail-fast pattern so we don't time out in disaster cases.

Precondition checks use 5m for operators and 1m for etcd; recovery/reboot keeps 15m. Ensure helpers accept timeout parameter.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link

coderabbitai bot commented Feb 25, 2026

Walkthrough

Adds two precondition timeout constants and refactors health-check helpers to accept explicit timeout parameters; updates call sites to use the new timeouts instead of hard-coded durations and preserves an existing debug container timeout.

Changes

Cohort / File(s) Summary
Timeouts & Health-check helpers
test/extended/two_node/utils/common.go
Adds preconditionClusterHealthyTimeout = 5 * time.Minute and preconditionEtcdHealthyTimeout = 1 * time.Minute; retains debugContainerTimeout = 60 * time.Second. Refactors ensureClusterOperatorHealthy(oc *exutil.CLI, timeout time.Duration), ensureEtcdPodsAreRunning(oc *exutil.CLI, timeout time.Duration), and ensureEtcdHasTwoVotingMembers(nodes *corev1.NodeList, ecf *helpers.EtcdClientFactoryImpl, timeout time.Duration) to accept explicit timeouts. Updates callers (e.g., SkipIfClusterIsNotHealthy) to pass the new precondition timeouts and aligns log messages and internal timeout usage.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Test Structure And Quality ❓ Inconclusive The file referenced in the PR summary does not exist in the repository; searches for related functions and timeout constants yield no results. Verify the correct file path exists and is accessible, confirm the PR changes are committed to the current branch, or provide the accurate file location.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main refactoring objective: separating precondition timeouts from recovery cluster health in the TNF tests, which is precisely what the changeset accomplishes.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Stable And Deterministic Test Names ✅ Passed The PR modifies a utility/helper file (common.go) with timeout constants and health-check functions, not test definitions with titles. No test names are changed.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.5.0)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 25, 2026
@openshift-ci openshift-ci bot requested review from eggfoobar and qJkee February 25, 2026 20:31
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
test/extended/two_node/utils/common.go (1)

1185-1205: ⚠️ Potential issue | 🟠 Major

ensureEtcdPodsAreRunning bypasses its timeout on pod-count failures.

Line 1203 returns immediately when fewer than 2 pods are running, so this path does not “wait” despite accepting a timeout. This can cause false precondition skips during brief transitions.

Proposed fix
 		} else {
 			runningPods := 0
 			for _, pod := range etcdPods.Items {
 				if pod.Status.Phase == corev1.PodRunning {
 					runningPods++
 				}
 			}
-			if runningPods < 2 {
-				return fmt.Errorf("expected at least 2 etcd pods running, found %d", runningPods)
-			}
-
-			framework.Logf("SUCCESS: found the 2 expected Etcd pods")
-			return nil
+			if runningPods < 2 {
+				err = fmt.Errorf("expected at least 2 etcd pods running, found %d", runningPods)
+			} else {
+				framework.Logf("SUCCESS: found the 2 expected Etcd pods")
+				return nil
+			}
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/two_node/utils/common.go` around lines 1185 - 1205, The
function ensureEtcdPodsAreRunning currently returns immediately when runningPods
< 2, bypassing the provided timeout; change the logic so pod-count failures set
a local error (e.g., err = fmt.Errorf("expected at least 2 etcd pods running,
found %d", runningPods)) and allow the for loop to continue until the context
deadline instead of returning; also use the created ctx when calling
oc.AdminKubeClient().CoreV1().Pods(...).List(ctx, ...) and after the loop
returns the most recent error if the context times out (or nil on success),
optionally adding a short sleep/backoff between retries to avoid tight looping.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@test/extended/two_node/utils/common.go`:
- Around line 1185-1205: The function ensureEtcdPodsAreRunning currently returns
immediately when runningPods < 2, bypassing the provided timeout; change the
logic so pod-count failures set a local error (e.g., err = fmt.Errorf("expected
at least 2 etcd pods running, found %d", runningPods)) and allow the for loop to
continue until the context deadline instead of returning; also use the created
ctx when calling oc.AdminKubeClient().CoreV1().Pods(...).List(ctx, ...) and
after the loop returns the most recent error if the context times out (or nil on
success), optionally adding a short sleep/backoff between retries to avoid tight
looping.

ℹ️ Review info

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between a631c60 and 5d188d0.

📒 Files selected for processing (1)
  • test/extended/two_node/utils/common.go

@jaypoulz
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-ipv6-recovery-techpreview

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 25, 2026

@jaypoulz: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-ipv6-recovery-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/e691a440-1289-11f1-936b-8aa12b8bffa6-0

@eggfoobar
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 25, 2026
@eggfoobar
Copy link
Contributor

/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-ipv6-recovery-techpreview

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 26, 2026

@eggfoobar: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-ipv6-recovery-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/f7658120-12d3-11f1-8774-b61f9d4b6895-0

@jaypoulz jaypoulz force-pushed the tnf-skip-aggresively branch from 5d188d0 to 23dc7fa Compare February 26, 2026 13:10
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Feb 26, 2026
@eggfoobar
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 26, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
test/extended/two_node/utils/common.go (1)

1186-1205: ⚠️ Potential issue | 🟠 Major

ensureEtcdPodsAreRunning does not actually wait up to the provided timeout.

Line 1191 uses context.Background() instead of the timeout-scoped ctx, so the API call itself is not bounded by this function's timeout. Additionally, line 1204 returns immediately when fewer than 2 etcd pods are running, bypassing the timeout check logic entirely. The function should retry while the timeout context is still valid instead of failing immediately.

Suggested fix
 func ensureEtcdPodsAreRunning(oc *exutil.CLI, timeout time.Duration) error {
 	framework.Logf("Ensure Etcd pods are running (timeout: %v)", timeout)
 	ctx, cancel := context.WithTimeout(context.Background(), timeout)
 	defer cancel()
 	for {
-		etcdPods, err := oc.AdminKubeClient().CoreV1().Pods("openshift-etcd").List(context.Background(), metav1.ListOptions{
+		etcdPods, err := oc.AdminKubeClient().CoreV1().Pods("openshift-etcd").List(ctx, metav1.ListOptions{
 			LabelSelector: "app=etcd",
 		})
 		if err != nil {
 			err = fmt.Errorf("failed to retrieve etcd pods: %v", err)
 		} else {
 			runningPods := 0
 			for _, pod := range etcdPods.Items {
 				if pod.Status.Phase == corev1.PodRunning {
 					runningPods++
 				}
 			}
 			if runningPods < 2 {
-				return fmt.Errorf("expected at least 2 etcd pods running, found %d", runningPods)
+				err = fmt.Errorf("expected at least 2 etcd pods running, found %d", runningPods)
+			} else {
+				framework.Logf("SUCCESS: found the 2 expected Etcd pods")
+				return nil
 			}
-
-			framework.Logf("SUCCESS: found the 2 expected Etcd pods")
-			return nil
 		}
 
 		select {
 		case <-ctx.Done():
 			return err
 		default:
 		}
 		time.Sleep(FiveSecondPollInterval)
 	}
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/two_node/utils/common.go` around lines 1186 - 1205, The
ensureEtcdPodsAreRunning function currently uses context.Background() for the
List call and returns immediately when fewer than 2 pods are running; update it
to use the timeout-scoped ctx for the API call
(oc.AdminKubeClient().CoreV1().Pods(...).List(ctx,...)) and change the logic so
that when runningPods < 2 you do not return an error immediately but instead
loop/retry until ctx is done (check <-ctx.Done() to break and return a timeout
error). Ensure you still cancel the context via defer cancel(), and only return
success when the runningPods condition is met or return a clear timeout error
after ctx expires.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@test/extended/two_node/utils/common.go`:
- Around line 1186-1205: The ensureEtcdPodsAreRunning function currently uses
context.Background() for the List call and returns immediately when fewer than 2
pods are running; update it to use the timeout-scoped ctx for the API call
(oc.AdminKubeClient().CoreV1().Pods(...).List(ctx,...)) and change the logic so
that when runningPods < 2 you do not return an error immediately but instead
loop/retry until ctx is done (check <-ctx.Done() to break and return a timeout
error). Ensure you still cancel the context via defer cancel(), and only return
success when the runningPods condition is met or return a clear timeout error
after ctx expires.

ℹ️ Review info

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 5d188d0 and 23dc7fa.

📒 Files selected for processing (1)
  • test/extended/two_node/utils/common.go

…cluster health

Precondition checks use 5m for operators and 1m for etcd; recovery/reboot keeps 15m. Ensure helpers accept timeout parameter.

Co-authored-by: Cursor <cursoragent@cursor.com>
@jaypoulz jaypoulz force-pushed the tnf-skip-aggresively branch from 23dc7fa to 6013d7f Compare February 26, 2026 14:09
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Feb 26, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
test/extended/two_node/utils/common.go (1)

1185-1217: ⚠️ Potential issue | 🟠 Major

ensureEtcdPodsAreRunning returns immediately on insufficient pods instead of retrying within timeout, and ignores the timeout context.

The function creates a timeout-bound context at line 1188 but never uses it. The Pod list call at line 1191 uses context.Background() instead of ctx, so the Kubernetes API call has no timeout constraint. Additionally, line 1203 returns an error immediately when fewer than 2 pods are running, preventing the function from retrying within the timeout window. The loop should only exit on timeout or success; transient pod unavailability should trigger a retry after the sleep interval, not an immediate failure.

Fix by: (1) changing context.Background() to ctx in the List call, and (2) inverting the condition at line 1203 so that success (runningPods >= 2) returns immediately, while insufficient pods continues the loop and retries.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/two_node/utils/common.go` around lines 1185 - 1217, In
ensureEtcdPodsAreRunning, use the timeout-bound context for the API call and
only return on success: replace the List call's context.Background() with ctx,
and invert the pods check so that if runningPods >= 2 you return nil (success),
otherwise do not return an error immediately but continue the loop (allowing the
select on ctx.Done() to handle timeout and then return the last error). Ensure
the function still logs success via framework.Logf when returning nil and
preserves the existing err wrapping for List failures.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@test/extended/two_node/utils/common.go`:
- Around line 1185-1217: In ensureEtcdPodsAreRunning, use the timeout-bound
context for the API call and only return on success: replace the List call's
context.Background() with ctx, and invert the pods check so that if runningPods
>= 2 you return nil (success), otherwise do not return an error immediately but
continue the loop (allowing the select on ctx.Done() to handle timeout and then
return the last error). Ensure the function still logs success via
framework.Logf when returning nil and preserves the existing err wrapping for
List failures.

ℹ️ Review info

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 23dc7fa and 6013d7f.

📒 Files selected for processing (1)
  • test/extended/two_node/utils/common.go

@openshift-ci-robot
Copy link

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@eggfoobar
Copy link
Contributor

/verified bypass
/lgtm

Verified not required, this is just a simple change to timeouts that stops us from waiting too long.

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Feb 26, 2026
@openshift-ci-robot
Copy link

@eggfoobar: The verified label has been added.

Details

In response to this:

/verified bypass
/lgtm

Verified not required, this is just a simple change to timeouts that stops us from waiting too long.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 26, 2026
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 26, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: eggfoobar, jaypoulz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jaypoulz
Copy link
Contributor Author

/retest-required

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 26, 2026

@jaypoulz: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 2a1c975 into openshift:main Feb 26, 2026
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants