NO-JIRA: refactor(tnf): separate precondition timeouts from recovery cluster health#30812
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
@jaypoulz: This pull request explicitly references no jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughAdds two precondition timeout constants and refactors health-check helpers to accept explicit timeout parameters; updates call sites to use the new timeouts instead of hard-coded durations and preserves an existing debug container timeout. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.5.0)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
test/extended/two_node/utils/common.go (1)
1185-1205:⚠️ Potential issue | 🟠 Major
ensureEtcdPodsAreRunningbypasses its timeout on pod-count failures.Line 1203 returns immediately when fewer than 2 pods are running, so this path does not “wait” despite accepting a timeout. This can cause false precondition skips during brief transitions.
Proposed fix
} else { runningPods := 0 for _, pod := range etcdPods.Items { if pod.Status.Phase == corev1.PodRunning { runningPods++ } } - if runningPods < 2 { - return fmt.Errorf("expected at least 2 etcd pods running, found %d", runningPods) - } - - framework.Logf("SUCCESS: found the 2 expected Etcd pods") - return nil + if runningPods < 2 { + err = fmt.Errorf("expected at least 2 etcd pods running, found %d", runningPods) + } else { + framework.Logf("SUCCESS: found the 2 expected Etcd pods") + return nil + } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/two_node/utils/common.go` around lines 1185 - 1205, The function ensureEtcdPodsAreRunning currently returns immediately when runningPods < 2, bypassing the provided timeout; change the logic so pod-count failures set a local error (e.g., err = fmt.Errorf("expected at least 2 etcd pods running, found %d", runningPods)) and allow the for loop to continue until the context deadline instead of returning; also use the created ctx when calling oc.AdminKubeClient().CoreV1().Pods(...).List(ctx, ...) and after the loop returns the most recent error if the context times out (or nil on success), optionally adding a short sleep/backoff between retries to avoid tight looping.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@test/extended/two_node/utils/common.go`:
- Around line 1185-1205: The function ensureEtcdPodsAreRunning currently returns
immediately when runningPods < 2, bypassing the provided timeout; change the
logic so pod-count failures set a local error (e.g., err = fmt.Errorf("expected
at least 2 etcd pods running, found %d", runningPods)) and allow the for loop to
continue until the context deadline instead of returning; also use the created
ctx when calling oc.AdminKubeClient().CoreV1().Pods(...).List(ctx, ...) and
after the loop returns the most recent error if the context times out (or nil on
success), optionally adding a short sleep/backoff between retries to avoid tight
looping.
ℹ️ Review info
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to data retention organization setting
📒 Files selected for processing (1)
test/extended/two_node/utils/common.go
|
/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-ipv6-recovery-techpreview |
|
@jaypoulz: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/e691a440-1289-11f1-936b-8aa12b8bffa6-0 |
|
/lgtm |
|
/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-ipv6-recovery-techpreview |
|
@eggfoobar: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/f7658120-12d3-11f1-8774-b61f9d4b6895-0 |
5d188d0 to
23dc7fa
Compare
|
/lgtm |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
test/extended/two_node/utils/common.go (1)
1186-1205:⚠️ Potential issue | 🟠 Major
ensureEtcdPodsAreRunningdoes not actually wait up to the provided timeout.Line 1191 uses
context.Background()instead of the timeout-scopedctx, so the API call itself is not bounded by this function's timeout. Additionally, line 1204 returns immediately when fewer than 2 etcd pods are running, bypassing the timeout check logic entirely. The function should retry while the timeout context is still valid instead of failing immediately.Suggested fix
func ensureEtcdPodsAreRunning(oc *exutil.CLI, timeout time.Duration) error { framework.Logf("Ensure Etcd pods are running (timeout: %v)", timeout) ctx, cancel := context.WithTimeout(context.Background(), timeout) defer cancel() for { - etcdPods, err := oc.AdminKubeClient().CoreV1().Pods("openshift-etcd").List(context.Background(), metav1.ListOptions{ + etcdPods, err := oc.AdminKubeClient().CoreV1().Pods("openshift-etcd").List(ctx, metav1.ListOptions{ LabelSelector: "app=etcd", }) if err != nil { err = fmt.Errorf("failed to retrieve etcd pods: %v", err) } else { runningPods := 0 for _, pod := range etcdPods.Items { if pod.Status.Phase == corev1.PodRunning { runningPods++ } } if runningPods < 2 { - return fmt.Errorf("expected at least 2 etcd pods running, found %d", runningPods) + err = fmt.Errorf("expected at least 2 etcd pods running, found %d", runningPods) + } else { + framework.Logf("SUCCESS: found the 2 expected Etcd pods") + return nil } - - framework.Logf("SUCCESS: found the 2 expected Etcd pods") - return nil } select { case <-ctx.Done(): return err default: } time.Sleep(FiveSecondPollInterval) } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/two_node/utils/common.go` around lines 1186 - 1205, The ensureEtcdPodsAreRunning function currently uses context.Background() for the List call and returns immediately when fewer than 2 pods are running; update it to use the timeout-scoped ctx for the API call (oc.AdminKubeClient().CoreV1().Pods(...).List(ctx,...)) and change the logic so that when runningPods < 2 you do not return an error immediately but instead loop/retry until ctx is done (check <-ctx.Done() to break and return a timeout error). Ensure you still cancel the context via defer cancel(), and only return success when the runningPods condition is met or return a clear timeout error after ctx expires.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@test/extended/two_node/utils/common.go`:
- Around line 1186-1205: The ensureEtcdPodsAreRunning function currently uses
context.Background() for the List call and returns immediately when fewer than 2
pods are running; update it to use the timeout-scoped ctx for the API call
(oc.AdminKubeClient().CoreV1().Pods(...).List(ctx,...)) and change the logic so
that when runningPods < 2 you do not return an error immediately but instead
loop/retry until ctx is done (check <-ctx.Done() to break and return a timeout
error). Ensure you still cancel the context via defer cancel(), and only return
success when the runningPods condition is met or return a clear timeout error
after ctx expires.
ℹ️ Review info
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to data retention organization setting
📒 Files selected for processing (1)
test/extended/two_node/utils/common.go
…cluster health Precondition checks use 5m for operators and 1m for etcd; recovery/reboot keeps 15m. Ensure helpers accept timeout parameter. Co-authored-by: Cursor <cursoragent@cursor.com>
23dc7fa to
6013d7f
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
test/extended/two_node/utils/common.go (1)
1185-1217:⚠️ Potential issue | 🟠 Major
ensureEtcdPodsAreRunningreturns immediately on insufficient pods instead of retrying within timeout, and ignores the timeout context.The function creates a timeout-bound context at line 1188 but never uses it. The Pod list call at line 1191 uses
context.Background()instead ofctx, so the Kubernetes API call has no timeout constraint. Additionally, line 1203 returns an error immediately when fewer than 2 pods are running, preventing the function from retrying within the timeout window. The loop should only exit on timeout or success; transient pod unavailability should trigger a retry after the sleep interval, not an immediate failure.Fix by: (1) changing
context.Background()toctxin the List call, and (2) inverting the condition at line 1203 so that success (runningPods >= 2) returns immediately, while insufficient pods continues the loop and retries.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/two_node/utils/common.go` around lines 1185 - 1217, In ensureEtcdPodsAreRunning, use the timeout-bound context for the API call and only return on success: replace the List call's context.Background() with ctx, and invert the pods check so that if runningPods >= 2 you return nil (success), otherwise do not return an error immediately but continue the loop (allowing the select on ctx.Done() to handle timeout and then return the last error). Ensure the function still logs success via framework.Logf when returning nil and preserves the existing err wrapping for List failures.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@test/extended/two_node/utils/common.go`:
- Around line 1185-1217: In ensureEtcdPodsAreRunning, use the timeout-bound
context for the API call and only return on success: replace the List call's
context.Background() with ctx, and invert the pods check so that if runningPods
>= 2 you return nil (success), otherwise do not return an error immediately but
continue the loop (allowing the select on ctx.Done() to handle timeout and then
return the last error). Ensure the function still logs success via
framework.Logf when returning nil and preserves the existing err wrapping for
List failures.
ℹ️ Review info
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to data retention organization setting
📒 Files selected for processing (1)
test/extended/two_node/utils/common.go
|
Scheduling required tests: |
|
/verified bypass Verified not required, this is just a simple change to timeouts that stops us from waiting too long. |
|
@eggfoobar: The DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: eggfoobar, jaypoulz The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest-required |
|
@jaypoulz: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Updated TNF recovery tests to have a fail-fast pattern so we don't time out in disaster cases.
Precondition checks use 5m for operators and 1m for etcd; recovery/reboot keeps 15m. Ensure helpers accept timeout parameter.
Summary by CodeRabbit