OCPBUGS-38676: Add retry for transient API errors to prevent Degraded blips by sg00dwin · Pull Request #1164 · openshift/console-operator

sg00dwin · 2026-05-27T18:30:10Z

Summary

The console ClusterOperator intermittently flips Degraded=True during upgrade and CI runs when transient API errors (conflicts, timeouts, connection refused) hit resourceapply.Apply*() calls. The error flows directly to the status handler with no retry, setting Degraded=True for ~1 minute until the next sync resolves it. This causes CI test failures (8.7% failure rate in OCP 5.0).

This PR adds a shared retry utility and wraps all API write operations that can trigger Degraded conditions, absorbing transient errors before they reach status reporting.

Changes

New: pkg/console/controllers/util/retry.go — shared RetryOnTransientError() with backoff (3 steps, 500ms base, 2x factor, ~3.5s max). Retries all errors except permanent ones (Forbidden, Invalid, AlreadyExists, etc.).
New: pkg/console/controllers/util/retry_test.go — 10 unit tests covering error classification and retry behavior.
New: pkg/console/operator/retry.go — thin wrapper for the operator package.
Modified: Wrapped 16 API write call sites across 10 files:
- sync_v400.go — ConfigMap, ServiceCA, TrustedCA, Deployment, ConsoleConfig, PublicConfig, SessionSecret
- Individual controllers — PDB, Service, CLIDownloads, DownloadsDeployment, ServiceAccounts, OAuthClients, OAuthClientSecret, UpgradeNotification, StorageVersionMigration
Not modified: Healthcheck controller (already has its own retry from OCPBUGS-24041)

Testing

All existing unit tests pass
New retry utility tests cover: retryable vs permanent error classification, retry-then-succeed, retry exhaustion, generic network error retry
make test-unit passes clean
gofmt / go vet clean

Summary by CodeRabbit

Bug Fixes
- Enhanced reliability by adding automatic retry logic for transient failures across multiple resource operations (create, update, delete, apply) in Kubernetes API interactions, reducing false failures.
Tests
- Added comprehensive test coverage validating transient error retry behavior and backoff policies.

blips Wrap all API write operations (Apply, Create, Update, Delete) with retry logic to absorb transient errors (conflicts, timeouts, connection refused) during upgrades before they reach status reporting. Without retry, a single transient failure sets Degraded=True for ~1 minute until the next sync resolves it, causing CI test failures. Retry parameters: 3 attempts, 500ms backoff with 2x factor (~3.5s max). Permanent errors (Forbidden, Invalid, AlreadyExists) are not retried. Assisted by Claude Code (Opus 4.6)

openshift-ci-robot · 2026-05-27T18:30:16Z

@sg00dwin: This pull request references Jira Issue OCPBUGS-38676, which is invalid:

expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

The console ClusterOperator intermittently flips Degraded=True during upgrade and CI runs when transient API errors (conflicts, timeouts, connection refused) hit resourceapply.Apply*() calls. The error flows directly to the status handler with no retry, setting Degraded=True for ~1 minute until the next sync resolves it. This causes CI test failures (8.7% failure rate in OCP 5.0).

This PR adds a shared retry utility and wraps all API write operations that can trigger Degraded conditions, absorbing transient errors before they reach status reporting.

Changes

New: pkg/console/controllers/util/retry.go — shared RetryOnTransientError() with backoff (3 steps, 500ms base, 2x factor, ~3.5s max). Retries all errors except permanent ones (Forbidden, Invalid, AlreadyExists, etc.).

New: pkg/console/controllers/util/retry_test.go — 10 unit tests covering error classification and retry behavior.

New: pkg/console/operator/retry.go — thin wrapper for the operator package.

Modified: Wrapped 16 API write call sites across 10 files:

sync_v400.go — ConfigMap, ServiceCA, TrustedCA, Deployment, ConsoleConfig, PublicConfig, SessionSecret

Individual controllers — PDB, Service, CLIDownloads, DownloadsDeployment, ServiceAccounts, OAuthClients, OAuthClientSecret, UpgradeNotification, StorageVersionMigration

Not modified: Healthcheck controller (already has its own retry from OCPBUGS-24041)

Testing

All existing unit tests pass

New retry utility tests cover: retryable vs permanent error classification, retry-then-succeed, retry exhaustion, generic network error retry

make test-unit passes clean

gofmt / go vet clean

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-05-27T18:30:24Z

Walkthrough

This PR adds transient error retry logic for Kubernetes API operations across console controllers and operator sync methods. A new retry utility module with backoff configuration and error classification is introduced, then consistently applied to wrap create, update, apply, and delete operations in nine controllers and eight operator sync functions.

Changes

Transient Error Retry Integration

Layer / File(s)	Summary
Retry infrastructure `pkg/console/controllers/util/retry.go`, `pkg/console/controllers/util/retry_test.go`	Defines `TransientBackoff` (3 attempts, 500ms base, 2.0 factor), `IsRetryableError` predicate (retries most errors except Forbidden, Invalid, MethodNotSupported, NotAcceptable, AlreadyExists), and `RetryOnTransientError(fn)` wrapper that retries with backoff and V(4) logging. Tests validate classification and control flow across scenarios.
Console controller retry wrapping `pkg/console/controllers/clidownloads/controller.go`, `pkg/console/controllers/downloadsdeployment/controller.go`, `pkg/console/controllers/oauthclients/oauthclients.go`, `pkg/console/controllers/oauthclientsecret/oauthclientsecret.go`, `pkg/console/controllers/poddisruptionbudget/controller.go`, `pkg/console/controllers/service/controller.go`, `pkg/console/controllers/serviceaccounts/controller.go`, `pkg/console/controllers/storageversionmigration/controller.go`, `pkg/console/controllers/upgradenotification/controller.go`	Nine controllers wrap their Kubernetes API operations (create, update, apply, delete) in `util.RetryOnTransientError`. Each retries on transient failures and returns consistent failure reasons on final failure.
Operator retry helper and sync wrapping `pkg/console/operator/retry.go`, `pkg/console/operator/sync_v400.go`	Operator-local `retryOnTransientError` wrapper delegates to `util.RetryOnTransientError`. Eight sync functions (SyncConsoleConfig, SyncConsolePublicConfig, SyncDeployment, SyncConfigMap, SyncServiceCAConfigMap, SyncTrustedCAConfigMap, syncSessionSecret) wrap their Kubernetes API calls with retry logic.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 13 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality	❓ Inconclusive	The custom check specifies "Ginkgo test code" requirements (It blocks, BeforeEach/AfterEach, Eventually/Consistently timeouts), but the PR's test file uses standard Go's testing package, not Ginkgo.	Clarify if check applies to standard Go tests. If yes, retry_test.go properly uses table-driven tests with descriptive messages and covers scenarios via t.Run sub-tests; no changes needed.

✅ Passed checks (13 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly references the Jira issue (OCPBUGS-38676) and accurately summarizes the main change: adding retry logic for transient API errors to prevent Degraded status blips.
Description check	✅ Passed	The PR description is comprehensive and covers the root cause, solution, testing, and implementation details. While not formatted as the template suggests, it provides all essential information needed for review.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	All test names in the PR are stable and deterministic. No dynamic content, pod names, timestamps, UUIDs, namespaces with suffixes, or fmt.Sprintf found in test titles.
Microshift Test Compatibility	✅ Passed	This PR does not add any Ginkgo e2e tests. The only new test file (retry_test.go) is a standard Go unit test with testing.T, not a Ginkgo test.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	PR adds no Ginkgo e2e tests; only standard Go unit tests (testing.T) for retry utility in pkg/console/controllers/util/retry_test.go, which are not SNO-specific.
Topology-Aware Scheduling Compatibility	✅ Passed	PR only adds retry logic for transient API errors. No scheduling constraints, affinity rules, topology constraints, or topology-related changes are introduced.
Ote Binary Stdout Contract	✅ Passed	RetryOnTransientError only called from controller methods. Uses klog.V(4).Infof() which writes to STDERR, not STDOUT. No fmt.Print or os.Stdout writes found.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	No Ginkgo e2e tests are added in this PR. All test files (including the new retry_test.go) use standard Go testing package. Check not applicable.
No-Weak-Crypto	✅ Passed	No weak cryptography detected: no MD5/SHA1/DES/RC4 usage, no custom crypto implementations, and string comparison used for config reconciliation (not authentication). Uses secure crypto/rand.
Container-Privileges	✅ Passed	PR modifies only Go controller code, no container/K8s manifests changed. Existing deployments have secure configs with allowPrivilegeEscalation: false and no privileged settings.
No-Sensitive-Data-In-Logs	✅ Passed	PR adds one logging statement for Kubernetes API errors at V(4) level in retry.go. API errors contain only status/messages, not request body or credentials.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-05-27T18:31:21Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sg00dwin
Once this PR has been reviewed and has the lgtm label, please assign jhadvig for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sg00dwin · 2026-05-27T18:32:08Z

/jira refresh

openshift-ci-robot · 2026-05-27T18:32:15Z

@sg00dwin: This pull request references Jira Issue OCPBUGS-38676, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2026-05-27T18:34:10Z

@sg00dwin: This pull request references Jira Issue OCPBUGS-38676, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Details

In response to this:

Summary

The console ClusterOperator intermittently flips Degraded=True during upgrade and CI runs when transient API errors (conflicts, timeouts, connection refused) hit resourceapply.Apply*() calls. The error flows directly to the status handler with no retry, setting Degraded=True for ~1 minute until the next sync resolves it. This causes CI test failures (8.7% failure rate in OCP 5.0).

This PR adds a shared retry utility and wraps all API write operations that can trigger Degraded conditions, absorbing transient errors before they reach status reporting.

Changes

New: pkg/console/controllers/util/retry.go — shared RetryOnTransientError() with backoff (3 steps, 500ms base, 2x factor, ~3.5s max). Retries all errors except permanent ones (Forbidden, Invalid, AlreadyExists, etc.).

New: pkg/console/controllers/util/retry_test.go — 10 unit tests covering error classification and retry behavior.

New: pkg/console/operator/retry.go — thin wrapper for the operator package.

Modified: Wrapped 16 API write call sites across 10 files:

sync_v400.go — ConfigMap, ServiceCA, TrustedCA, Deployment, ConsoleConfig, PublicConfig, SessionSecret

Individual controllers — PDB, Service, CLIDownloads, DownloadsDeployment, ServiceAccounts, OAuthClients, OAuthClientSecret, UpgradeNotification, StorageVersionMigration

Not modified: Healthcheck controller (already has its own retry from OCPBUGS-24041)

Testing

All existing unit tests pass

New retry utility tests cover: retryable vs permanent error classification, retry-then-succeed, retry exhaustion, generic network error retry

make test-unit passes clean

gofmt / go vet clean

Summary by CodeRabbit

Bug Fixes

Enhanced reliability by adding automatic retry logic for transient failures across multiple resource operations (create, update, delete, apply) in Kubernetes API interactions, reducing false failures.

Tests

Added comprehensive test coverage validating transient error retry behavior and backoff policies.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

pkg/console/controllers/service/controller.go (1)

144-149: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Propagate redirect sync errors instead of shadowing svcErr.

At Line 144, redirectSvcErrReason, svcErr := ... shadows the outer svcErr. Line 148 then returns the old outer value, so redirect failures are not returned from Sync.

Suggested fix

-	if c.serviceName == api.OpenShiftConsoleServiceName {
-		redirectSvcErrReason, svcErr := c.SyncRedirectService(ctx, routeConfig, controllerContext)
-		statusHandler.AddConditions(status.HandleProgressingOrDegraded("RedirectServiceSync", redirectSvcErrReason, svcErr))
-	}
-
-	return statusHandler.FlushAndReturn(svcErr)
+	if c.serviceName == api.OpenShiftConsoleServiceName {
+		redirectSvcErrReason, redirectSvcErr := c.SyncRedirectService(ctx, routeConfig, controllerContext)
+		statusHandler.AddConditions(status.HandleProgressingOrDegraded("RedirectServiceSync", redirectSvcErrReason, redirectSvcErr))
+		if redirectSvcErr != nil {
+			return statusHandler.FlushAndReturn(redirectSvcErr)
+		}
+	}
+
+	return statusHandler.FlushAndReturn(svcErr)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/console/controllers/service/controller.go` around lines 144 - 149, The
redirect sync call is shadowing the outer svcErr (redirectSvcErrReason, svcErr
:= c.SyncRedirectService...), causing the function to return the old svcErr;
change the declaration to assign to the existing outer variable (use = instead
of :=) or use a distinct temporary name and then set the outer svcErr = the
returned error; ensure the call to SyncRedirectService (and the variables
redirectSvcErrReason and svcErr) updates the same svcErr that FlushAndReturn
uses and keep statusHandler.AddConditions using the updated svcErr.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/console/operator/sync_v400.go`:
- Around line 517-527: The Create retry may succeed server-side then later
return apierrors.IsAlreadyExists, causing SyncServiceCAConfigMap and
SyncTrustedCAConfigMap to surface a create error and flip Degraded; update the
createErr handling after retryOnTransientError (the block that sets actual and
createErr from retryOnTransientError calling
configMapClient.ConfigMaps(...).Create) to detect apierrors.IsAlreadyExists and
in that case call configMapClient.ConfigMaps(...).Get(ctx, required.Name,
metav1.GetOptions{}) to load the existing ConfigMap into actual and clear the
error (return actual, "", nil); apply the same AlreadyExists->Get pattern in
both SyncServiceCAConfigMap and SyncTrustedCAConfigMap so transient post-success
create races do not produce FailedCreate.
- Around line 264-270: The retry logic reuses stale object instances causing
Conflict retries to fail; inside retryOnTransientError closures for
SyncConsoleConfig (UpdateStatus) and for SyncServiceCAConfigMap /
SyncTrustedCAConfigMap (ConfigMaps Update) refetch the latest resource from the
API (not reuse the original updated/existing struct) before calling UpdateStatus
or Update so each attempt uses current resourceVersion; additionally, in the
Create wrappers used in those Sync* functions treat AlreadyExists as success by
fetching the existing object and proceeding (instead of returning FailedCreate),
and keep IsRetryableError behavior but ensure Update/UpdateStatus retries
operate on freshly fetched objects rather than the initial stale instance.

---

Outside diff comments:
In `@pkg/console/controllers/service/controller.go`:
- Around line 144-149: The redirect sync call is shadowing the outer svcErr
(redirectSvcErrReason, svcErr := c.SyncRedirectService...), causing the function
to return the old svcErr; change the declaration to assign to the existing outer
variable (use = instead of :=) or use a distinct temporary name and then set the
outer svcErr = the returned error; ensure the call to SyncRedirectService (and
the variables redirectSvcErrReason and svcErr) updates the same svcErr that
FlushAndReturn uses and keep statusHandler.AddConditions using the updated
svcErr.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 12887e8d-b1d9-4882-b7b9-b035820dd2e0

📥 Commits

Reviewing files that changed from the base of the PR and between c54e288 and 1a8cda1.

📒 Files selected for processing (13)

pkg/console/controllers/clidownloads/controller.go
pkg/console/controllers/downloadsdeployment/controller.go
pkg/console/controllers/oauthclients/oauthclients.go
pkg/console/controllers/oauthclientsecret/oauthclientsecret.go
pkg/console/controllers/poddisruptionbudget/controller.go
pkg/console/controllers/service/controller.go
pkg/console/controllers/serviceaccounts/controller.go
pkg/console/controllers/storageversionmigration/controller.go
pkg/console/controllers/upgradenotification/controller.go
pkg/console/controllers/util/retry.go
pkg/console/controllers/util/retry_test.go
pkg/console/operator/retry.go
pkg/console/operator/sync_v400.go

📜 Review details

🧰 Additional context used

📓 Path-based instructions (4)

**/*.{yaml,yml,go}

📄 CodeRabbit inference engine (Custom checks)

Topology-Aware Scheduling Compatibility: Deployment manifests, operator code, and controllers should not introduce scheduling constraints that assume a standard HA topology with 3+ control-plane nodes and dedicated workers. Flag required pod anti-affinity without topology checks, topology spread constraints with DoNotSchedule that exceed schedulable node counts, replica counts derived from node count without considering SNO/TNF/TNA, nodeSelector targeting control-plane nodes, broad tolerations that could schedule to arbiter nodes on TNA, code targeting only worker nodes, or PodDisruptionBudgets designed for 3+ nodes.

Files:

pkg/console/controllers/downloadsdeployment/controller.go
pkg/console/controllers/upgradenotification/controller.go
pkg/console/controllers/serviceaccounts/controller.go
pkg/console/operator/retry.go
pkg/console/controllers/storageversionmigration/controller.go
pkg/console/controllers/oauthclients/oauthclients.go
pkg/console/controllers/service/controller.go
pkg/console/controllers/util/retry.go
pkg/console/operator/sync_v400.go
pkg/console/controllers/clidownloads/controller.go
pkg/console/controllers/poddisruptionbudget/controller.go
pkg/console/controllers/util/retry_test.go
pkg/console/controllers/oauthclientsecret/oauthclientsecret.go

**/*.go

⚙️ CodeRabbit configuration file

**/*.go: Review Go code following OpenShift operator patterns.
See CONVENTIONS.md for coding standards and patterns.

Refer to the following skills based on CODE PATTERNS, not just file paths:

Refer to /controller-review when code contains:

Controller struct types (e.g., type *Controller struct)

func New*Controller( factory functions

factory.New().WithFilteredEventsInformers( pattern

.ToController( method calls

Sync(ctx context.Context, controllerContext factory.SyncContext) methods

operatorConfig.Spec.ManagementState checks

status.NewStatusHandler or status.Handle* functions

Refer to /sync-handler-review when code contains:

Main operator sync functions (e.g., sync_v400.go content)

Sequential resource syncing with early returns

Incremental reconciliation loops

Multiple resourceapply.Apply*() calls in sequence

Dependency ordering of ConfigMaps → Secrets → Service Accounts → RBAC → Services → Deployments → Routes

Feature gate conditional logic

Refer to /go-quality-review for all Go code to check:

Deprecated imports: ioutil.ReadFile, ioutil.WriteFile, ioutil.ReadAll

Deprecated patterns: Dial without DialContext

Error handling: missing %w in fmt.Errorf

Code smells: deep nesting (4+ levels), functions >100 lines

Magic values: unexplained numbers/strings

Context propagation: context.Background() instead of passed ctx

Missing godoc on exported functions

Files:

pkg/console/controllers/downloadsdeployment/controller.go
pkg/console/controllers/upgradenotification/controller.go
pkg/console/controllers/serviceaccounts/controller.go
pkg/console/operator/retry.go
pkg/console/controllers/storageversionmigration/controller.go
pkg/console/controllers/oauthclients/oauthclients.go
pkg/console/controllers/service/controller.go
pkg/console/controllers/util/retry.go
pkg/console/operator/sync_v400.go
pkg/console/controllers/clidownloads/controller.go
pkg/console/controllers/poddisruptionbudget/controller.go
pkg/console/controllers/util/retry_test.go
pkg/console/controllers/oauthclientsecret/oauthclientsecret.go

**/*.{py,js,ts,go,rs,java,rb,php,kt,swift,cs}

⚙️ CodeRabbit configuration file

**/*.{py,js,ts,go,rs,java,rb,php,kt,swift,cs}: Injection prevention (prodsec-skills):

SQL: parameterized queries only; no string concatenation

Command: no shell=True, os.system, or backtick exec with user input

LDAP/XPath: escape special characters in filters

Path traversal: canonicalize paths, reject ../

Deserialization: no pickle/yaml.load()/eval on untrusted data

Prototype pollution: no recursive merge of untrusted objects

Validate at trust boundaries with allow-lists, not deny-lists

Normalize Unicode and anchor regexes (^$); watch for ReDoS

Files:

pkg/console/controllers/downloadsdeployment/controller.go
pkg/console/controllers/upgradenotification/controller.go
pkg/console/controllers/serviceaccounts/controller.go
pkg/console/operator/retry.go
pkg/console/controllers/storageversionmigration/controller.go
pkg/console/controllers/oauthclients/oauthclients.go
pkg/console/controllers/service/controller.go
pkg/console/controllers/util/retry.go
pkg/console/operator/sync_v400.go
pkg/console/controllers/clidownloads/controller.go
pkg/console/controllers/poddisruptionbudget/controller.go
pkg/console/controllers/util/retry_test.go
pkg/console/controllers/oauthclientsecret/oauthclientsecret.go

**/*_test.go

📄 CodeRabbit inference engine (Custom checks)

**/*_test.go: IPv6 and Disconnected Network Test Compatibility: When new Ginkgo e2e tests are added (It(), Describe(), Context(), When(), etc.), flag tests that contain IPv4 assumptions such as hardcoded IPv4 addresses, IPv4-only parsing logic, IPv4-specific service/endpoint creation, or hardcoded IPv4 network policies. Also flag tests requiring external connectivity to public internet services, pulling images from public registries, or DNS resolution of public hostnames.
MicroShift Test Compatibility: When new Ginkgo e2e tests are added (It(), Describe(), Context(), When(), etc.), flag tests that reference OpenShift-specific APIs unavailable on MicroShift (except Route and SecurityContextConstraints), such as Project/ProjectRequest, Build/BuildConfig, DeploymentConfig, ClusterOperator, ClusterVersion, Etcd operator, CSV/OLM, MachineSet/Machine, ClusterAutoscaler, Console, Monitoring components, ImageRegistry operator, Samples operator, OperatorHub, CloudCredential, or Storage/Network operators.
OTE Binary Stdout Contract: OpenShift Tests Extension (OTE) binaries must communicate with openshift-tests via JSON on stdout only. Flag any stdout writes in process-level code (main(), init(), TestMain(), BeforeSuite(), AfterSuite(), SynchronizedBeforeSuite(), RunSpecs() setup, or top-level var/const initializers). Common violations include klog writing to stdout (must redirect to stderr via klog.SetOutput(os.Stderr) or klog.LogToStderr(true)), and fmt.Print/Println/Printf to stdout in main or suite setup.
Single Node OpenShift (SNO) Test Compatibility: When new Ginkgo e2e tests are added (It(), Describe(), Context(), When(), etc.), flag tests that make multi-node or HA assumptions such as expecting multiple control-plane nodes, multiple worker nodes, pod anti-affinity across nodes, node-to-node communication patterns, leader election failover across replicas, pod rescheduling to different nodes, node scaling operations, or separate node roles.
Stable and Deterministic Te...

Files:

pkg/console/controllers/util/retry_test.go

⚙️ CodeRabbit configuration file

**/*_test.go: Review test code for quality and patterns.

Refer to /unit-test-review when test is in pkg//*_test.go:**

Table-driven test structure with test cases

Use of go-test/deep for struct comparisons

Test naming conventions (TestFunctionName)

Error handling with wantErr pattern

Edge case coverage (nil, empty, boundary values)

Proper assertions with helpful error messages

Test isolation (no shared mutable state)

Refer to /e2e-test-review when test contains:

framework.MustNewClientset(t, nil) or similar e2e framework usage

wait.Poll or wait.PollImmediate patterns

retry.RetryOnConflict for updates

Cleanup via defer functions

Console/operator CR manipulations

Test assertions on cluster state

Suggest to use /e2e-test-review when:

PR adds new feature requiring e2e coverage

Test file is empty or skeleton

Comments indicate "TODO: add test"

Review for common issues:

Missing cleanup (defer statements)

Using time.Sleep instead of wait.Poll

Missing context timeouts

Vague error messages in assertions

Tests without table-driven structure when testing multiple cases

Ignoring errors with _

Tests without assertions

Files:

pkg/console/controllers/util/retry_test.go

🔇 Additional comments (12)

pkg/console/controllers/util/retry.go (1)

13-51: LGTM!

pkg/console/controllers/util/retry_test.go (1)

11-165: LGTM!

pkg/console/controllers/clidownloads/controller.go (1)

231-240: LGTM!

Also applies to: 257-266

pkg/console/controllers/downloadsdeployment/controller.go (1)

121-134: LGTM!

pkg/console/controllers/oauthclients/oauthclients.go (1)

229-233: LGTM!

pkg/console/controllers/oauthclientsecret/oauthclientsecret.go (1)

164-167: LGTM!

pkg/console/controllers/poddisruptionbudget/controller.go (1)

100-103: LGTM!

pkg/console/controllers/serviceaccounts/controller.go (1)

155-166: LGTM!

pkg/console/controllers/storageversionmigration/controller.go (1)

17-17: LGTM!

Also applies to: 107-109

pkg/console/controllers/upgradenotification/controller.go (1)

134-147: LGTM!

pkg/console/operator/retry.go (1)

8-10: LGTM!

pkg/console/operator/sync_v400.go (1)

277-285: LGTM!

Also applies to: 323-334, 434-441, 844-849

coderabbitai · 2026-05-27T18:42:41Z

+		var actual *corev1.ConfigMap
+		createErr := retryOnTransientError(func() error {
+			var e error
+			actual, e = co.configMapClient.ConfigMaps(required.Namespace).Create(ctx, required, metav1.CreateOptions{})
+			return e
+		})
+		if createErr == nil {
 			klog.V(4).Infoln("service-ca configmap created")
-			return actual, "", err
-		} else {
-			return actual, "FailedCreate", err
+			return actual, "", nil
 		}
+		return actual, "FailedCreate", createErr


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Verify create retry call sites. rg -n "retryOnTransientError\$func\\(\$ error \\{[\\s\\S]*?Create\\(" pkg/console/operator/sync_v400.go -U # Verify how AlreadyExists is classified in retry utility. rg -n "AlreadyExists|IsAlreadyExists|IsRetryableError" pkg/console/controllers/util/retry.go -C3

Repository: openshift/console-operator

Length of output: 11743

🏁 Script executed:

#!/bin/bash set -euo pipefail # Show exact code around the flagged create retry blocks. sed -n '500,590p' pkg/console/operator/sync_v400.go # Locate and show retry helper definitions (name + implementation). rg -n "retryOnTransientError|RetryOnTransientError" pkg/console/operator/sync_v400.go pkg/console/controllers/util/retry.go sed -n '1,120p' pkg/console/controllers/util/retry.go

Repository: openshift/console-operator

Length of output: 5929

🏁 Script executed:

#!/bin/bash set -euo pipefail rg -n "retryOnTransientError" pkg/console/operator/sync_v400.go pkg/console -S

Repository: openshift/console-operator

Length of output: 1718

🏁 Script executed:

#!/bin/bash set -euo pipefail sed -n '1,160p' pkg/console/operator/retry.go

Repository: openshift/console-operator

Length of output: 276

🏁 Script executed:

#!/bin/bash set -euo pipefail rg -n "\"FailedCreate\"|FailedCreate" pkg/console/operator -S

Repository: openshift/console-operator

Length of output: 233

🏁 Script executed:

#!/bin/bash set -euo pipefail rg -n "SyncServiceCAConfigMap\\(|SyncTrustedCAConfigMap\\(" pkg/console/operator -S

Repository: openshift/console-operator

Length of output: 756

🏁 Script executed:

#!/bin/bash set -euo pipefail sed -n '120,220p' pkg/console/operator/sync_v400.go

Repository: openshift/console-operator

Length of output: 4300

🏁 Script executed:

#!/bin/bash set -euo pipefail rg -n "func HandleProgressingOrDegraded\\(" -S pkg/console-operator pkg || true rg -n "HandleProgressingOrDegraded\\(" -S pkg/console/operator pkg/console/controllers pkg | head

Repository: openshift/console-operator

Length of output: 2161

🏁 Script executed:

sed -n '80,180p' pkg/console/status/status.go

Repository: openshift/console-operator

Length of output: 4251

🏁 Script executed:

#!/bin/bash set -euo pipefail rg -n "func IsSyncError\\(|IsSyncError\\(" pkg/console -S

Repository: openshift/console-operator

Length of output: 205

🏁 Script executed:

sed -n '1,120p' pkg/console/errors/sync_errors.go

Repository: openshift/console-operator

Length of output: 920

Avoid Degraded blips when ConfigMap Create succeeded but retry hits AlreadyExists

SyncServiceCAConfigMap and SyncTrustedCAConfigMap retry ConfigMaps(...).Create() via retryOnTransientError; apierrors.IsAlreadyExists is treated as non-retryable, so a second attempt that returns AlreadyExists leads to returning "FailedCreate".

With a transient failure after server-side success, this returns a non-nil error into status.HandleProgressingOrDegraded(...), setting Degraded=True for ServiceCASync/TrustedCASync even though the ConfigMap exists.

Suggested fix pattern

createErr := retryOnTransientError(func() error { var e error actual, e = co.configMapClient.ConfigMaps(required.Namespace).Create(ctx, required, metav1.CreateOptions{}) return e }) + if apierrors.IsAlreadyExists(createErr) { + actual, createErr = co.configMapClient.ConfigMaps(required.Namespace).Get(ctx, required.Name, metav1.GetOptions{}) + if createErr == nil { + return actual, "", nil + } + } if createErr == nil { return actual, "", nil }

Apply the same AlreadyExists -> Get handling to both the service-ca and trusted-ca Create retries in pkg/console/operator/sync_v400.go.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

var actual *corev1.ConfigMap

createErr := retryOnTransientError(func() error {

var e error

actual, e = co.configMapClient.ConfigMaps(required.Namespace).Create(ctx, required, metav1.CreateOptions{})

return e

})

if createErr == nil {

klog.V(4).Infoln("service-ca configmap created")

return actual, "", err

} else {

return actual, "FailedCreate", err

return actual, "", nil

}

return actual, "FailedCreate", createErr

var actual *corev1.ConfigMap

createErr := retryOnTransientError(func() error {

var e error

actual, e = co.configMapClient.ConfigMaps(required.Namespace).Create(ctx, required, metav1.CreateOptions{})

return e

})

if apierrors.IsAlreadyExists(createErr) {

actual, createErr = co.configMapClient.ConfigMaps(required.Namespace).Get(ctx, required.Name, metav1.GetOptions{})

if createErr == nil {

return actual, "", nil

}

}

if createErr == nil {

klog.V(4).Infoln("service-ca configmap created")

return actual, "", nil

}

return actual, "FailedCreate", createErr

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/console/operator/sync_v400.go` around lines 517 - 527, The Create retry may succeed server-side then later return apierrors.IsAlreadyExists, causing SyncServiceCAConfigMap and SyncTrustedCAConfigMap to surface a create error and flip Degraded; update the createErr handling after retryOnTransientError (the block that sets actual and createErr from retryOnTransientError calling configMapClient.ConfigMaps(...).Create) to detect apierrors.IsAlreadyExists and in that case call configMapClient.ConfigMaps(...).Get(ctx, required.Name, metav1.GetOptions{}) to load the existing ConfigMap into actual and clear the error (return actual, "", nil); apply the same AlreadyExists->Get pattern in both SyncServiceCAConfigMap and SyncTrustedCAConfigMap so transient post-success create races do not produce FailedCreate.

[Claude] Same reasoning as the reply above — the Create paths here are only reached when the lister returned NotFound. AlreadyExists is already classified as non-retryable in IsRetryableError, so the retry stops immediately on that error with no wasted attempts.

The phantom-success scenario (server-side Create succeeds but client gets a timeout, retry hits AlreadyExists) is a narrow race that self-heals on the next 60-second reconcile. Adding a Get-on-AlreadyExists fallback would add complexity for an edge case that resolves itself within one sync cycle. The original code before this PR had the same behavior — AlreadyExists on Create was always returned as a failure.

@sg00dwin, fair point — since AlreadyExists is already classified as non-retryable, the retry wrapper exits immediately on that error without looping, and the phantom-success race is both narrow and self-correcting within one reconcile cycle. The added complexity of a Get-on-AlreadyExists fallback isn't justified for such an edge case. Understood, withdrawing the suggestion.

openshift-ci · 2026-05-27T21:50:36Z

@sg00dwin: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-console	`1a8cda1`	link	true	`/test e2e-aws-console`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 27, 2026

openshift-ci Bot requested review from jhadvig and spadgett May 27, 2026 18:31

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 27, 2026

coderabbitai Bot reviewed May 27, 2026

View reviewed changes

Conversation

sg00dwin commented May 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing

Summary by CodeRabbit

Uh oh!

openshift-ci-robot commented May 27, 2026

Summary

Changes

Testing

Uh oh!

coderabbitai Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

openshift-ci Bot commented May 27, 2026

Uh oh!

sg00dwin commented May 27, 2026

Uh oh!

openshift-ci-robot commented May 27, 2026

Uh oh!

openshift-ci-robot commented May 27, 2026

Summary

Changes

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sg00dwin May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

openshift-ci Bot commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sg00dwin commented May 27, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 27, 2026 •

edited

Loading

coderabbitai Bot May 27, 2026 •

edited

Loading

sg00dwin May 27, 2026 •

edited

Loading