Skip to content

OCPBUGS-38676: Add retry for transient API errors to prevent Degraded blips#1164

Open
sg00dwin wants to merge 1 commit into
openshift:mainfrom
sg00dwin:OCPBUGS-38676-degraded-blips-fix
Open

OCPBUGS-38676: Add retry for transient API errors to prevent Degraded blips#1164
sg00dwin wants to merge 1 commit into
openshift:mainfrom
sg00dwin:OCPBUGS-38676-degraded-blips-fix

Conversation

@sg00dwin
Copy link
Copy Markdown
Member

@sg00dwin sg00dwin commented May 27, 2026

Summary

The console ClusterOperator intermittently flips Degraded=True during upgrade and CI runs when transient API errors (conflicts, timeouts, connection refused) hit resourceapply.Apply*() calls. The error flows directly to the status handler with no retry, setting Degraded=True for ~1 minute until the next sync resolves it. This causes CI test failures (8.7% failure rate in OCP 5.0).

This PR adds a shared retry utility and wraps all API write operations that can trigger Degraded conditions, absorbing transient errors before they reach status reporting.

Changes

  • New: pkg/console/controllers/util/retry.go — shared RetryOnTransientError() with backoff (3 steps, 500ms base, 2x factor, ~3.5s max). Retries all errors except permanent ones (Forbidden, Invalid, AlreadyExists, etc.).
  • New: pkg/console/controllers/util/retry_test.go — 10 unit tests covering error classification and retry behavior.
  • New: pkg/console/operator/retry.go — thin wrapper for the operator package.
  • Modified: Wrapped 16 API write call sites across 10 files:
    • sync_v400.go — ConfigMap, ServiceCA, TrustedCA, Deployment, ConsoleConfig, PublicConfig, SessionSecret
    • Individual controllers — PDB, Service, CLIDownloads, DownloadsDeployment, ServiceAccounts, OAuthClients, OAuthClientSecret, UpgradeNotification, StorageVersionMigration
  • Not modified: Healthcheck controller (already has its own retry from OCPBUGS-24041)

Testing

  • All existing unit tests pass
  • New retry utility tests cover: retryable vs permanent error classification, retry-then-succeed, retry exhaustion, generic network error retry
  • make test-unit passes clean
  • gofmt / go vet clean

Summary by CodeRabbit

  • Bug Fixes

    • Enhanced reliability by adding automatic retry logic for transient failures across multiple resource operations (create, update, delete, apply) in Kubernetes API interactions, reducing false failures.
  • Tests

    • Added comprehensive test coverage validating transient error retry behavior and backoff policies.

  blips

  Wrap all API write operations (Apply, Create, Update, Delete) with
  retry logic to absorb transient errors (conflicts, timeouts, connection
  refused) during upgrades before they reach status reporting. Without
  retry, a single transient failure sets Degraded=True for ~1 minute
  until the next sync resolves it, causing CI test failures.

  Retry parameters: 3 attempts, 500ms backoff with 2x factor (~3.5s max).
  Permanent errors (Forbidden, Invalid, AlreadyExists) are not retried.

  Assisted by Claude Code (Opus 4.6)
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 27, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@sg00dwin: This pull request references Jira Issue OCPBUGS-38676, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

The console ClusterOperator intermittently flips Degraded=True during upgrade and CI runs when transient API errors (conflicts, timeouts, connection refused) hit resourceapply.Apply*() calls. The error flows directly to the status handler with no retry, setting Degraded=True for ~1 minute until the next sync resolves it. This causes CI test failures (8.7% failure rate in OCP 5.0).

This PR adds a shared retry utility and wraps all API write operations that can trigger Degraded conditions, absorbing transient errors before they reach status reporting.

Changes

  • New: pkg/console/controllers/util/retry.go — shared RetryOnTransientError() with backoff (3 steps, 500ms base, 2x factor, ~3.5s max). Retries all errors except permanent ones (Forbidden, Invalid, AlreadyExists, etc.).
  • New: pkg/console/controllers/util/retry_test.go — 10 unit tests covering error classification and retry behavior.
  • New: pkg/console/operator/retry.go — thin wrapper for the operator package.
  • Modified: Wrapped 16 API write call sites across 10 files:
  • sync_v400.go — ConfigMap, ServiceCA, TrustedCA, Deployment, ConsoleConfig, PublicConfig, SessionSecret
  • Individual controllers — PDB, Service, CLIDownloads, DownloadsDeployment, ServiceAccounts, OAuthClients, OAuthClientSecret, UpgradeNotification, StorageVersionMigration
  • Not modified: Healthcheck controller (already has its own retry from OCPBUGS-24041)

Testing

  • All existing unit tests pass
  • New retry utility tests cover: retryable vs permanent error classification, retry-then-succeed, retry exhaustion, generic network error retry
  • make test-unit passes clean
  • gofmt / go vet clean

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

Walkthrough

This PR adds transient error retry logic for Kubernetes API operations across console controllers and operator sync methods. A new retry utility module with backoff configuration and error classification is introduced, then consistently applied to wrap create, update, apply, and delete operations in nine controllers and eight operator sync functions.

Changes

Transient Error Retry Integration

Layer / File(s) Summary
Retry infrastructure
pkg/console/controllers/util/retry.go, pkg/console/controllers/util/retry_test.go
Defines TransientBackoff (3 attempts, 500ms base, 2.0 factor), IsRetryableError predicate (retries most errors except Forbidden, Invalid, MethodNotSupported, NotAcceptable, AlreadyExists), and RetryOnTransientError(fn) wrapper that retries with backoff and V(4) logging. Tests validate classification and control flow across scenarios.
Console controller retry wrapping
pkg/console/controllers/clidownloads/controller.go, pkg/console/controllers/downloadsdeployment/controller.go, pkg/console/controllers/oauthclients/oauthclients.go, pkg/console/controllers/oauthclientsecret/oauthclientsecret.go, pkg/console/controllers/poddisruptionbudget/controller.go, pkg/console/controllers/service/controller.go, pkg/console/controllers/serviceaccounts/controller.go, pkg/console/controllers/storageversionmigration/controller.go, pkg/console/controllers/upgradenotification/controller.go
Nine controllers wrap their Kubernetes API operations (create, update, apply, delete) in util.RetryOnTransientError. Each retries on transient failures and returns consistent failure reasons on final failure.
Operator retry helper and sync wrapping
pkg/console/operator/retry.go, pkg/console/operator/sync_v400.go
Operator-local retryOnTransientError wrapper delegates to util.RetryOnTransientError. Eight sync functions (SyncConsoleConfig, SyncConsolePublicConfig, SyncDeployment, SyncConfigMap, SyncServiceCAConfigMap, SyncTrustedCAConfigMap, syncSessionSecret) wrap their Kubernetes API calls with retry logic.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 13 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ❓ Inconclusive The custom check specifies "Ginkgo test code" requirements (It blocks, BeforeEach/AfterEach, Eventually/Consistently timeouts), but the PR's test file uses standard Go's testing package, not Ginkgo. Clarify if check applies to standard Go tests. If yes, retry_test.go properly uses table-driven tests with descriptive messages and covers scenarios via t.Run sub-tests; no changes needed.
✅ Passed checks (13 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly references the Jira issue (OCPBUGS-38676) and accurately summarizes the main change: adding retry logic for transient API errors to prevent Degraded status blips.
Description check ✅ Passed The PR description is comprehensive and covers the root cause, solution, testing, and implementation details. While not formatted as the template suggests, it provides all essential information needed for review.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names in the PR are stable and deterministic. No dynamic content, pod names, timestamps, UUIDs, namespaces with suffixes, or fmt.Sprintf found in test titles.
Microshift Test Compatibility ✅ Passed This PR does not add any Ginkgo e2e tests. The only new test file (retry_test.go) is a standard Go unit test with testing.T, not a Ginkgo test.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR adds no Ginkgo e2e tests; only standard Go unit tests (testing.T) for retry utility in pkg/console/controllers/util/retry_test.go, which are not SNO-specific.
Topology-Aware Scheduling Compatibility ✅ Passed PR only adds retry logic for transient API errors. No scheduling constraints, affinity rules, topology constraints, or topology-related changes are introduced.
Ote Binary Stdout Contract ✅ Passed RetryOnTransientError only called from controller methods. Uses klog.V(4).Infof() which writes to STDERR, not STDOUT. No fmt.Print or os.Stdout writes found.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No Ginkgo e2e tests are added in this PR. All test files (including the new retry_test.go) use standard Go testing package. Check not applicable.
No-Weak-Crypto ✅ Passed No weak cryptography detected: no MD5/SHA1/DES/RC4 usage, no custom crypto implementations, and string comparison used for config reconciliation (not authentication). Uses secure crypto/rand.
Container-Privileges ✅ Passed PR modifies only Go controller code, no container/K8s manifests changed. Existing deployments have secure configs with allowPrivilegeEscalation: false and no privileged settings.
No-Sensitive-Data-In-Logs ✅ Passed PR adds one logging statement for Kubernetes API errors at V(4) level in retry.go. API errors contain only status/messages, not request body or credentials.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from jhadvig and spadgett May 27, 2026 18:31
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 27, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sg00dwin
Once this PR has been reviewed and has the lgtm label, please assign jhadvig for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sg00dwin
Copy link
Copy Markdown
Member Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 27, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@sg00dwin: This pull request references Jira Issue OCPBUGS-38676, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@sg00dwin: This pull request references Jira Issue OCPBUGS-38676, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

Summary

The console ClusterOperator intermittently flips Degraded=True during upgrade and CI runs when transient API errors (conflicts, timeouts, connection refused) hit resourceapply.Apply*() calls. The error flows directly to the status handler with no retry, setting Degraded=True for ~1 minute until the next sync resolves it. This causes CI test failures (8.7% failure rate in OCP 5.0).

This PR adds a shared retry utility and wraps all API write operations that can trigger Degraded conditions, absorbing transient errors before they reach status reporting.

Changes

  • New: pkg/console/controllers/util/retry.go — shared RetryOnTransientError() with backoff (3 steps, 500ms base, 2x factor, ~3.5s max). Retries all errors except permanent ones (Forbidden, Invalid, AlreadyExists, etc.).
  • New: pkg/console/controllers/util/retry_test.go — 10 unit tests covering error classification and retry behavior.
  • New: pkg/console/operator/retry.go — thin wrapper for the operator package.
  • Modified: Wrapped 16 API write call sites across 10 files:
  • sync_v400.go — ConfigMap, ServiceCA, TrustedCA, Deployment, ConsoleConfig, PublicConfig, SessionSecret
  • Individual controllers — PDB, Service, CLIDownloads, DownloadsDeployment, ServiceAccounts, OAuthClients, OAuthClientSecret, UpgradeNotification, StorageVersionMigration
  • Not modified: Healthcheck controller (already has its own retry from OCPBUGS-24041)

Testing

  • All existing unit tests pass
  • New retry utility tests cover: retryable vs permanent error classification, retry-then-succeed, retry exhaustion, generic network error retry
  • make test-unit passes clean
  • gofmt / go vet clean

Summary by CodeRabbit

  • Bug Fixes

  • Enhanced reliability by adding automatic retry logic for transient failures across multiple resource operations (create, update, delete, apply) in Kubernetes API interactions, reducing false failures.

  • Tests

  • Added comprehensive test coverage validating transient error retry behavior and backoff policies.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/console/controllers/service/controller.go (1)

144-149: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Propagate redirect sync errors instead of shadowing svcErr.

At Line 144, redirectSvcErrReason, svcErr := ... shadows the outer svcErr. Line 148 then returns the old outer value, so redirect failures are not returned from Sync.

Suggested fix
-	if c.serviceName == api.OpenShiftConsoleServiceName {
-		redirectSvcErrReason, svcErr := c.SyncRedirectService(ctx, routeConfig, controllerContext)
-		statusHandler.AddConditions(status.HandleProgressingOrDegraded("RedirectServiceSync", redirectSvcErrReason, svcErr))
-	}
-
-	return statusHandler.FlushAndReturn(svcErr)
+	if c.serviceName == api.OpenShiftConsoleServiceName {
+		redirectSvcErrReason, redirectSvcErr := c.SyncRedirectService(ctx, routeConfig, controllerContext)
+		statusHandler.AddConditions(status.HandleProgressingOrDegraded("RedirectServiceSync", redirectSvcErrReason, redirectSvcErr))
+		if redirectSvcErr != nil {
+			return statusHandler.FlushAndReturn(redirectSvcErr)
+		}
+	}
+
+	return statusHandler.FlushAndReturn(svcErr)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/console/controllers/service/controller.go` around lines 144 - 149, The
redirect sync call is shadowing the outer svcErr (redirectSvcErrReason, svcErr
:= c.SyncRedirectService...), causing the function to return the old svcErr;
change the declaration to assign to the existing outer variable (use = instead
of :=) or use a distinct temporary name and then set the outer svcErr = the
returned error; ensure the call to SyncRedirectService (and the variables
redirectSvcErrReason and svcErr) updates the same svcErr that FlushAndReturn
uses and keep statusHandler.AddConditions using the updated svcErr.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/console/operator/sync_v400.go`:
- Around line 517-527: The Create retry may succeed server-side then later
return apierrors.IsAlreadyExists, causing SyncServiceCAConfigMap and
SyncTrustedCAConfigMap to surface a create error and flip Degraded; update the
createErr handling after retryOnTransientError (the block that sets actual and
createErr from retryOnTransientError calling
configMapClient.ConfigMaps(...).Create) to detect apierrors.IsAlreadyExists and
in that case call configMapClient.ConfigMaps(...).Get(ctx, required.Name,
metav1.GetOptions{}) to load the existing ConfigMap into actual and clear the
error (return actual, "", nil); apply the same AlreadyExists->Get pattern in
both SyncServiceCAConfigMap and SyncTrustedCAConfigMap so transient post-success
create races do not produce FailedCreate.
- Around line 264-270: The retry logic reuses stale object instances causing
Conflict retries to fail; inside retryOnTransientError closures for
SyncConsoleConfig (UpdateStatus) and for SyncServiceCAConfigMap /
SyncTrustedCAConfigMap (ConfigMaps Update) refetch the latest resource from the
API (not reuse the original updated/existing struct) before calling UpdateStatus
or Update so each attempt uses current resourceVersion; additionally, in the
Create wrappers used in those Sync* functions treat AlreadyExists as success by
fetching the existing object and proceeding (instead of returning FailedCreate),
and keep IsRetryableError behavior but ensure Update/UpdateStatus retries
operate on freshly fetched objects rather than the initial stale instance.

---

Outside diff comments:
In `@pkg/console/controllers/service/controller.go`:
- Around line 144-149: The redirect sync call is shadowing the outer svcErr
(redirectSvcErrReason, svcErr := c.SyncRedirectService...), causing the function
to return the old svcErr; change the declaration to assign to the existing outer
variable (use = instead of :=) or use a distinct temporary name and then set the
outer svcErr = the returned error; ensure the call to SyncRedirectService (and
the variables redirectSvcErrReason and svcErr) updates the same svcErr that
FlushAndReturn uses and keep statusHandler.AddConditions using the updated
svcErr.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 12887e8d-b1d9-4882-b7b9-b035820dd2e0

📥 Commits

Reviewing files that changed from the base of the PR and between c54e288 and 1a8cda1.

📒 Files selected for processing (13)
  • pkg/console/controllers/clidownloads/controller.go
  • pkg/console/controllers/downloadsdeployment/controller.go
  • pkg/console/controllers/oauthclients/oauthclients.go
  • pkg/console/controllers/oauthclientsecret/oauthclientsecret.go
  • pkg/console/controllers/poddisruptionbudget/controller.go
  • pkg/console/controllers/service/controller.go
  • pkg/console/controllers/serviceaccounts/controller.go
  • pkg/console/controllers/storageversionmigration/controller.go
  • pkg/console/controllers/upgradenotification/controller.go
  • pkg/console/controllers/util/retry.go
  • pkg/console/controllers/util/retry_test.go
  • pkg/console/operator/retry.go
  • pkg/console/operator/sync_v400.go
📜 Review details
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{yaml,yml,go}

📄 CodeRabbit inference engine (Custom checks)

Topology-Aware Scheduling Compatibility: Deployment manifests, operator code, and controllers should not introduce scheduling constraints that assume a standard HA topology with 3+ control-plane nodes and dedicated workers. Flag required pod anti-affinity without topology checks, topology spread constraints with DoNotSchedule that exceed schedulable node counts, replica counts derived from node count without considering SNO/TNF/TNA, nodeSelector targeting control-plane nodes, broad tolerations that could schedule to arbiter nodes on TNA, code targeting only worker nodes, or PodDisruptionBudgets designed for 3+ nodes.

Files:

  • pkg/console/controllers/downloadsdeployment/controller.go
  • pkg/console/controllers/upgradenotification/controller.go
  • pkg/console/controllers/serviceaccounts/controller.go
  • pkg/console/operator/retry.go
  • pkg/console/controllers/storageversionmigration/controller.go
  • pkg/console/controllers/oauthclients/oauthclients.go
  • pkg/console/controllers/service/controller.go
  • pkg/console/controllers/util/retry.go
  • pkg/console/operator/sync_v400.go
  • pkg/console/controllers/clidownloads/controller.go
  • pkg/console/controllers/poddisruptionbudget/controller.go
  • pkg/console/controllers/util/retry_test.go
  • pkg/console/controllers/oauthclientsecret/oauthclientsecret.go
**/*.go

⚙️ CodeRabbit configuration file

**/*.go: Review Go code following OpenShift operator patterns.
See CONVENTIONS.md for coding standards and patterns.

Refer to the following skills based on CODE PATTERNS, not just file paths:

Refer to /controller-review when code contains:

  • Controller struct types (e.g., type *Controller struct)
  • func New*Controller( factory functions
  • factory.New().WithFilteredEventsInformers( pattern
  • .ToController( method calls
  • Sync(ctx context.Context, controllerContext factory.SyncContext) methods
  • operatorConfig.Spec.ManagementState checks
  • status.NewStatusHandler or status.Handle* functions

Refer to /sync-handler-review when code contains:

  • Main operator sync functions (e.g., sync_v400.go content)
  • Sequential resource syncing with early returns
  • Incremental reconciliation loops
  • Multiple resourceapply.Apply*() calls in sequence
  • Dependency ordering of ConfigMaps → Secrets → Service Accounts → RBAC → Services → Deployments → Routes
  • Feature gate conditional logic

Refer to /go-quality-review for all Go code to check:

  • Deprecated imports: ioutil.ReadFile, ioutil.WriteFile, ioutil.ReadAll
  • Deprecated patterns: Dial without DialContext
  • Error handling: missing %w in fmt.Errorf
  • Code smells: deep nesting (4+ levels), functions >100 lines
  • Magic values: unexplained numbers/strings
  • Context propagation: context.Background() instead of passed ctx
  • Missing godoc on exported functions

Files:

  • pkg/console/controllers/downloadsdeployment/controller.go
  • pkg/console/controllers/upgradenotification/controller.go
  • pkg/console/controllers/serviceaccounts/controller.go
  • pkg/console/operator/retry.go
  • pkg/console/controllers/storageversionmigration/controller.go
  • pkg/console/controllers/oauthclients/oauthclients.go
  • pkg/console/controllers/service/controller.go
  • pkg/console/controllers/util/retry.go
  • pkg/console/operator/sync_v400.go
  • pkg/console/controllers/clidownloads/controller.go
  • pkg/console/controllers/poddisruptionbudget/controller.go
  • pkg/console/controllers/util/retry_test.go
  • pkg/console/controllers/oauthclientsecret/oauthclientsecret.go
**/*.{py,js,ts,go,rs,java,rb,php,kt,swift,cs}

⚙️ CodeRabbit configuration file

**/*.{py,js,ts,go,rs,java,rb,php,kt,swift,cs}: Injection prevention (prodsec-skills):

  • SQL: parameterized queries only; no string concatenation
  • Command: no shell=True, os.system, or backtick exec with user input
  • LDAP/XPath: escape special characters in filters
  • Path traversal: canonicalize paths, reject ../
  • Deserialization: no pickle/yaml.load()/eval on untrusted data
  • Prototype pollution: no recursive merge of untrusted objects
  • Validate at trust boundaries with allow-lists, not deny-lists
  • Normalize Unicode and anchor regexes (^$); watch for ReDoS

Files:

  • pkg/console/controllers/downloadsdeployment/controller.go
  • pkg/console/controllers/upgradenotification/controller.go
  • pkg/console/controllers/serviceaccounts/controller.go
  • pkg/console/operator/retry.go
  • pkg/console/controllers/storageversionmigration/controller.go
  • pkg/console/controllers/oauthclients/oauthclients.go
  • pkg/console/controllers/service/controller.go
  • pkg/console/controllers/util/retry.go
  • pkg/console/operator/sync_v400.go
  • pkg/console/controllers/clidownloads/controller.go
  • pkg/console/controllers/poddisruptionbudget/controller.go
  • pkg/console/controllers/util/retry_test.go
  • pkg/console/controllers/oauthclientsecret/oauthclientsecret.go
**/*_test.go

📄 CodeRabbit inference engine (Custom checks)

**/*_test.go: IPv6 and Disconnected Network Test Compatibility: When new Ginkgo e2e tests are added (It(), Describe(), Context(), When(), etc.), flag tests that contain IPv4 assumptions such as hardcoded IPv4 addresses, IPv4-only parsing logic, IPv4-specific service/endpoint creation, or hardcoded IPv4 network policies. Also flag tests requiring external connectivity to public internet services, pulling images from public registries, or DNS resolution of public hostnames.
MicroShift Test Compatibility: When new Ginkgo e2e tests are added (It(), Describe(), Context(), When(), etc.), flag tests that reference OpenShift-specific APIs unavailable on MicroShift (except Route and SecurityContextConstraints), such as Project/ProjectRequest, Build/BuildConfig, DeploymentConfig, ClusterOperator, ClusterVersion, Etcd operator, CSV/OLM, MachineSet/Machine, ClusterAutoscaler, Console, Monitoring components, ImageRegistry operator, Samples operator, OperatorHub, CloudCredential, or Storage/Network operators.
OTE Binary Stdout Contract: OpenShift Tests Extension (OTE) binaries must communicate with openshift-tests via JSON on stdout only. Flag any stdout writes in process-level code (main(), init(), TestMain(), BeforeSuite(), AfterSuite(), SynchronizedBeforeSuite(), RunSpecs() setup, or top-level var/const initializers). Common violations include klog writing to stdout (must redirect to stderr via klog.SetOutput(os.Stderr) or klog.LogToStderr(true)), and fmt.Print/Println/Printf to stdout in main or suite setup.
Single Node OpenShift (SNO) Test Compatibility: When new Ginkgo e2e tests are added (It(), Describe(), Context(), When(), etc.), flag tests that make multi-node or HA assumptions such as expecting multiple control-plane nodes, multiple worker nodes, pod anti-affinity across nodes, node-to-node communication patterns, leader election failover across replicas, pod rescheduling to different nodes, node scaling operations, or separate node roles.
Stable and Deterministic Te...

Files:

  • pkg/console/controllers/util/retry_test.go

⚙️ CodeRabbit configuration file

**/*_test.go: Review test code for quality and patterns.

Refer to /unit-test-review when test is in pkg//*_test.go:**

  • Table-driven test structure with test cases
  • Use of go-test/deep for struct comparisons
  • Test naming conventions (TestFunctionName)
  • Error handling with wantErr pattern
  • Edge case coverage (nil, empty, boundary values)
  • Proper assertions with helpful error messages
  • Test isolation (no shared mutable state)

Refer to /e2e-test-review when test contains:

  • framework.MustNewClientset(t, nil) or similar e2e framework usage
  • wait.Poll or wait.PollImmediate patterns
  • retry.RetryOnConflict for updates
  • Cleanup via defer functions
  • Console/operator CR manipulations
  • Test assertions on cluster state

Suggest to use /e2e-test-review when:

  • PR adds new feature requiring e2e coverage
  • Test file is empty or skeleton
  • Comments indicate "TODO: add test"

Review for common issues:

  • Missing cleanup (defer statements)
  • Using time.Sleep instead of wait.Poll
  • Missing context timeouts
  • Vague error messages in assertions
  • Tests without table-driven structure when testing multiple cases
  • Ignoring errors with _
  • Tests without assertions

Files:

  • pkg/console/controllers/util/retry_test.go
🔇 Additional comments (12)
pkg/console/controllers/util/retry.go (1)

13-51: LGTM!

pkg/console/controllers/util/retry_test.go (1)

11-165: LGTM!

pkg/console/controllers/clidownloads/controller.go (1)

231-240: LGTM!

Also applies to: 257-266

pkg/console/controllers/downloadsdeployment/controller.go (1)

121-134: LGTM!

pkg/console/controllers/oauthclients/oauthclients.go (1)

229-233: LGTM!

pkg/console/controllers/oauthclientsecret/oauthclientsecret.go (1)

164-167: LGTM!

pkg/console/controllers/poddisruptionbudget/controller.go (1)

100-103: LGTM!

pkg/console/controllers/serviceaccounts/controller.go (1)

155-166: LGTM!

pkg/console/controllers/storageversionmigration/controller.go (1)

17-17: LGTM!

Also applies to: 107-109

pkg/console/controllers/upgradenotification/controller.go (1)

134-147: LGTM!

pkg/console/operator/retry.go (1)

8-10: LGTM!

pkg/console/operator/sync_v400.go (1)

277-285: LGTM!

Also applies to: 323-334, 434-441, 844-849

Comment thread pkg/console/operator/sync_v400.go
Comment on lines +517 to +527
var actual *corev1.ConfigMap
createErr := retryOnTransientError(func() error {
var e error
actual, e = co.configMapClient.ConfigMaps(required.Namespace).Create(ctx, required, metav1.CreateOptions{})
return e
})
if createErr == nil {
klog.V(4).Infoln("service-ca configmap created")
return actual, "", err
} else {
return actual, "FailedCreate", err
return actual, "", nil
}
return actual, "FailedCreate", createErr
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Verify create retry call sites.
rg -n "retryOnTransientError\\(func\\(\\) error \\{[\\s\\S]*?Create\\(" pkg/console/operator/sync_v400.go -U

# Verify how AlreadyExists is classified in retry utility.
rg -n "AlreadyExists|IsAlreadyExists|IsRetryableError" pkg/console/controllers/util/retry.go -C3

Repository: openshift/console-operator

Length of output: 11743


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Show exact code around the flagged create retry blocks.
sed -n '500,590p' pkg/console/operator/sync_v400.go

# Locate and show retry helper definitions (name + implementation).
rg -n "retryOnTransientError|RetryOnTransientError" pkg/console/operator/sync_v400.go pkg/console/controllers/util/retry.go
sed -n '1,120p' pkg/console/controllers/util/retry.go

Repository: openshift/console-operator

Length of output: 5929


🏁 Script executed:

#!/bin/bash
set -euo pipefail
rg -n "retryOnTransientError" pkg/console/operator/sync_v400.go pkg/console -S

Repository: openshift/console-operator

Length of output: 1718


🏁 Script executed:

#!/bin/bash
set -euo pipefail
sed -n '1,160p' pkg/console/operator/retry.go

Repository: openshift/console-operator

Length of output: 276


🏁 Script executed:

#!/bin/bash
set -euo pipefail
rg -n "\"FailedCreate\"|FailedCreate" pkg/console/operator -S

Repository: openshift/console-operator

Length of output: 233


🏁 Script executed:

#!/bin/bash
set -euo pipefail
rg -n "SyncServiceCAConfigMap\\(|SyncTrustedCAConfigMap\\(" pkg/console/operator -S

Repository: openshift/console-operator

Length of output: 756


🏁 Script executed:

#!/bin/bash
set -euo pipefail
sed -n '120,220p' pkg/console/operator/sync_v400.go

Repository: openshift/console-operator

Length of output: 4300


🏁 Script executed:

#!/bin/bash
set -euo pipefail
rg -n "func HandleProgressingOrDegraded\\(" -S pkg/console-operator pkg || true
rg -n "HandleProgressingOrDegraded\\(" -S pkg/console/operator pkg/console/controllers pkg | head

Repository: openshift/console-operator

Length of output: 2161


🏁 Script executed:

sed -n '80,180p' pkg/console/status/status.go

Repository: openshift/console-operator

Length of output: 4251


🏁 Script executed:

#!/bin/bash
set -euo pipefail
rg -n "func IsSyncError\\(|IsSyncError\\(" pkg/console -S

Repository: openshift/console-operator

Length of output: 205


🏁 Script executed:

sed -n '1,120p' pkg/console/errors/sync_errors.go

Repository: openshift/console-operator

Length of output: 920


Avoid Degraded blips when ConfigMap Create succeeded but retry hits AlreadyExists

  • SyncServiceCAConfigMap and SyncTrustedCAConfigMap retry ConfigMaps(...).Create() via retryOnTransientError; apierrors.IsAlreadyExists is treated as non-retryable, so a second attempt that returns AlreadyExists leads to returning "FailedCreate".
  • With a transient failure after server-side success, this returns a non-nil error into status.HandleProgressingOrDegraded(...), setting Degraded=True for ServiceCASync/TrustedCASync even though the ConfigMap exists.
Suggested fix pattern
  createErr := retryOnTransientError(func() error {
    var e error
    actual, e = co.configMapClient.ConfigMaps(required.Namespace).Create(ctx, required, metav1.CreateOptions{})
    return e
  })
+ if apierrors.IsAlreadyExists(createErr) {
+   actual, createErr = co.configMapClient.ConfigMaps(required.Namespace).Get(ctx, required.Name, metav1.GetOptions{})
+   if createErr == nil {
+     return actual, "", nil
+   }
+ }
  if createErr == nil {
    return actual, "", nil
  }

Apply the same AlreadyExists -> Get handling to both the service-ca and trusted-ca Create retries in pkg/console/operator/sync_v400.go.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
var actual *corev1.ConfigMap
createErr := retryOnTransientError(func() error {
var e error
actual, e = co.configMapClient.ConfigMaps(required.Namespace).Create(ctx, required, metav1.CreateOptions{})
return e
})
if createErr == nil {
klog.V(4).Infoln("service-ca configmap created")
return actual, "", err
} else {
return actual, "FailedCreate", err
return actual, "", nil
}
return actual, "FailedCreate", createErr
var actual *corev1.ConfigMap
createErr := retryOnTransientError(func() error {
var e error
actual, e = co.configMapClient.ConfigMaps(required.Namespace).Create(ctx, required, metav1.CreateOptions{})
return e
})
if apierrors.IsAlreadyExists(createErr) {
actual, createErr = co.configMapClient.ConfigMaps(required.Namespace).Get(ctx, required.Name, metav1.GetOptions{})
if createErr == nil {
return actual, "", nil
}
}
if createErr == nil {
klog.V(4).Infoln("service-ca configmap created")
return actual, "", nil
}
return actual, "FailedCreate", createErr
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/console/operator/sync_v400.go` around lines 517 - 527, The Create retry
may succeed server-side then later return apierrors.IsAlreadyExists, causing
SyncServiceCAConfigMap and SyncTrustedCAConfigMap to surface a create error and
flip Degraded; update the createErr handling after retryOnTransientError (the
block that sets actual and createErr from retryOnTransientError calling
configMapClient.ConfigMaps(...).Create) to detect apierrors.IsAlreadyExists and
in that case call configMapClient.ConfigMaps(...).Get(ctx, required.Name,
metav1.GetOptions{}) to load the existing ConfigMap into actual and clear the
error (return actual, "", nil); apply the same AlreadyExists->Get pattern in
both SyncServiceCAConfigMap and SyncTrustedCAConfigMap so transient post-success
create races do not produce FailedCreate.

Copy link
Copy Markdown
Member Author

@sg00dwin sg00dwin May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] Same reasoning as the reply above — the Create paths here are only reached when the lister returned NotFound. AlreadyExists is already classified as non-retryable in IsRetryableError, so the retry stops immediately on that error with no wasted attempts.

The phantom-success scenario (server-side Create succeeds but client gets a timeout, retry hits AlreadyExists) is a narrow race that self-heals on the next 60-second reconcile. Adding a Get-on-AlreadyExists fallback would add complexity for an edge case that resolves itself within one sync cycle. The original code before this PR had the same behavior — AlreadyExists on Create was always returned as a failure.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sg00dwin, fair point — since AlreadyExists is already classified as non-retryable, the retry wrapper exits immediately on that error without looping, and the phantom-success race is both narrow and self-correcting within one reconcile cycle. The added complexity of a Get-on-AlreadyExists fallback isn't justified for such an edge case. Understood, withdrawing the suggestion.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 27, 2026

@sg00dwin: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-console 1a8cda1 link true /test e2e-aws-console

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants