OCPBUGS-38676: Add retry for transient API errors to prevent Degraded blips#1164
OCPBUGS-38676: Add retry for transient API errors to prevent Degraded blips#1164sg00dwin wants to merge 1 commit into
Conversation
blips Wrap all API write operations (Apply, Create, Update, Delete) with retry logic to absorb transient errors (conflicts, timeouts, connection refused) during upgrades before they reach status reporting. Without retry, a single transient failure sets Degraded=True for ~1 minute until the next sync resolves it, causing CI test failures. Retry parameters: 3 attempts, 500ms backoff with 2x factor (~3.5s max). Permanent errors (Forbidden, Invalid, AlreadyExists) are not retried. Assisted by Claude Code (Opus 4.6)
|
@sg00dwin: This pull request references Jira Issue OCPBUGS-38676, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughThis PR adds transient error retry logic for Kubernetes API operations across console controllers and operator sync methods. A new retry utility module with backoff configuration and error classification is introduced, then consistently applied to wrap create, update, apply, and delete operations in nine controllers and eight operator sync functions. ChangesTransient Error Retry Integration
🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 13 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (13 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: sg00dwin The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/jira refresh |
|
@sg00dwin: This pull request references Jira Issue OCPBUGS-38676, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@sg00dwin: This pull request references Jira Issue OCPBUGS-38676, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/console/controllers/service/controller.go (1)
144-149:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPropagate redirect sync errors instead of shadowing
svcErr.At Line 144,
redirectSvcErrReason, svcErr := ...shadows the outersvcErr. Line 148 then returns the old outer value, so redirect failures are not returned fromSync.Suggested fix
- if c.serviceName == api.OpenShiftConsoleServiceName { - redirectSvcErrReason, svcErr := c.SyncRedirectService(ctx, routeConfig, controllerContext) - statusHandler.AddConditions(status.HandleProgressingOrDegraded("RedirectServiceSync", redirectSvcErrReason, svcErr)) - } - - return statusHandler.FlushAndReturn(svcErr) + if c.serviceName == api.OpenShiftConsoleServiceName { + redirectSvcErrReason, redirectSvcErr := c.SyncRedirectService(ctx, routeConfig, controllerContext) + statusHandler.AddConditions(status.HandleProgressingOrDegraded("RedirectServiceSync", redirectSvcErrReason, redirectSvcErr)) + if redirectSvcErr != nil { + return statusHandler.FlushAndReturn(redirectSvcErr) + } + } + + return statusHandler.FlushAndReturn(svcErr)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/console/controllers/service/controller.go` around lines 144 - 149, The redirect sync call is shadowing the outer svcErr (redirectSvcErrReason, svcErr := c.SyncRedirectService...), causing the function to return the old svcErr; change the declaration to assign to the existing outer variable (use = instead of :=) or use a distinct temporary name and then set the outer svcErr = the returned error; ensure the call to SyncRedirectService (and the variables redirectSvcErrReason and svcErr) updates the same svcErr that FlushAndReturn uses and keep statusHandler.AddConditions using the updated svcErr.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/console/operator/sync_v400.go`:
- Around line 517-527: The Create retry may succeed server-side then later
return apierrors.IsAlreadyExists, causing SyncServiceCAConfigMap and
SyncTrustedCAConfigMap to surface a create error and flip Degraded; update the
createErr handling after retryOnTransientError (the block that sets actual and
createErr from retryOnTransientError calling
configMapClient.ConfigMaps(...).Create) to detect apierrors.IsAlreadyExists and
in that case call configMapClient.ConfigMaps(...).Get(ctx, required.Name,
metav1.GetOptions{}) to load the existing ConfigMap into actual and clear the
error (return actual, "", nil); apply the same AlreadyExists->Get pattern in
both SyncServiceCAConfigMap and SyncTrustedCAConfigMap so transient post-success
create races do not produce FailedCreate.
- Around line 264-270: The retry logic reuses stale object instances causing
Conflict retries to fail; inside retryOnTransientError closures for
SyncConsoleConfig (UpdateStatus) and for SyncServiceCAConfigMap /
SyncTrustedCAConfigMap (ConfigMaps Update) refetch the latest resource from the
API (not reuse the original updated/existing struct) before calling UpdateStatus
or Update so each attempt uses current resourceVersion; additionally, in the
Create wrappers used in those Sync* functions treat AlreadyExists as success by
fetching the existing object and proceeding (instead of returning FailedCreate),
and keep IsRetryableError behavior but ensure Update/UpdateStatus retries
operate on freshly fetched objects rather than the initial stale instance.
---
Outside diff comments:
In `@pkg/console/controllers/service/controller.go`:
- Around line 144-149: The redirect sync call is shadowing the outer svcErr
(redirectSvcErrReason, svcErr := c.SyncRedirectService...), causing the function
to return the old svcErr; change the declaration to assign to the existing outer
variable (use = instead of :=) or use a distinct temporary name and then set the
outer svcErr = the returned error; ensure the call to SyncRedirectService (and
the variables redirectSvcErrReason and svcErr) updates the same svcErr that
FlushAndReturn uses and keep statusHandler.AddConditions using the updated
svcErr.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 12887e8d-b1d9-4882-b7b9-b035820dd2e0
📒 Files selected for processing (13)
pkg/console/controllers/clidownloads/controller.gopkg/console/controllers/downloadsdeployment/controller.gopkg/console/controllers/oauthclients/oauthclients.gopkg/console/controllers/oauthclientsecret/oauthclientsecret.gopkg/console/controllers/poddisruptionbudget/controller.gopkg/console/controllers/service/controller.gopkg/console/controllers/serviceaccounts/controller.gopkg/console/controllers/storageversionmigration/controller.gopkg/console/controllers/upgradenotification/controller.gopkg/console/controllers/util/retry.gopkg/console/controllers/util/retry_test.gopkg/console/operator/retry.gopkg/console/operator/sync_v400.go
📜 Review details
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{yaml,yml,go}
📄 CodeRabbit inference engine (Custom checks)
Topology-Aware Scheduling Compatibility: Deployment manifests, operator code, and controllers should not introduce scheduling constraints that assume a standard HA topology with 3+ control-plane nodes and dedicated workers. Flag required pod anti-affinity without topology checks, topology spread constraints with DoNotSchedule that exceed schedulable node counts, replica counts derived from node count without considering SNO/TNF/TNA, nodeSelector targeting control-plane nodes, broad tolerations that could schedule to arbiter nodes on TNA, code targeting only worker nodes, or PodDisruptionBudgets designed for 3+ nodes.
Files:
pkg/console/controllers/downloadsdeployment/controller.gopkg/console/controllers/upgradenotification/controller.gopkg/console/controllers/serviceaccounts/controller.gopkg/console/operator/retry.gopkg/console/controllers/storageversionmigration/controller.gopkg/console/controllers/oauthclients/oauthclients.gopkg/console/controllers/service/controller.gopkg/console/controllers/util/retry.gopkg/console/operator/sync_v400.gopkg/console/controllers/clidownloads/controller.gopkg/console/controllers/poddisruptionbudget/controller.gopkg/console/controllers/util/retry_test.gopkg/console/controllers/oauthclientsecret/oauthclientsecret.go
**/*.go
⚙️ CodeRabbit configuration file
**/*.go: Review Go code following OpenShift operator patterns.
See CONVENTIONS.md for coding standards and patterns.Refer to the following skills based on CODE PATTERNS, not just file paths:
Refer to /controller-review when code contains:
- Controller struct types (e.g.,
type *Controller struct)func New*Controller(factory functionsfactory.New().WithFilteredEventsInformers(pattern.ToController(method callsSync(ctx context.Context, controllerContext factory.SyncContext)methodsoperatorConfig.Spec.ManagementStatechecksstatus.NewStatusHandlerorstatus.Handle*functionsRefer to /sync-handler-review when code contains:
- Main operator sync functions (e.g.,
sync_v400.gocontent)- Sequential resource syncing with early returns
- Incremental reconciliation loops
- Multiple
resourceapply.Apply*()calls in sequence- Dependency ordering of ConfigMaps → Secrets → Service Accounts → RBAC → Services → Deployments → Routes
- Feature gate conditional logic
Refer to /go-quality-review for all Go code to check:
- Deprecated imports:
ioutil.ReadFile,ioutil.WriteFile,ioutil.ReadAll- Deprecated patterns:
DialwithoutDialContext- Error handling: missing
%win fmt.Errorf- Code smells: deep nesting (4+ levels), functions >100 lines
- Magic values: unexplained numbers/strings
- Context propagation:
context.Background()instead of passed ctx- Missing godoc on exported functions
Files:
pkg/console/controllers/downloadsdeployment/controller.gopkg/console/controllers/upgradenotification/controller.gopkg/console/controllers/serviceaccounts/controller.gopkg/console/operator/retry.gopkg/console/controllers/storageversionmigration/controller.gopkg/console/controllers/oauthclients/oauthclients.gopkg/console/controllers/service/controller.gopkg/console/controllers/util/retry.gopkg/console/operator/sync_v400.gopkg/console/controllers/clidownloads/controller.gopkg/console/controllers/poddisruptionbudget/controller.gopkg/console/controllers/util/retry_test.gopkg/console/controllers/oauthclientsecret/oauthclientsecret.go
**/*.{py,js,ts,go,rs,java,rb,php,kt,swift,cs}
⚙️ CodeRabbit configuration file
**/*.{py,js,ts,go,rs,java,rb,php,kt,swift,cs}: Injection prevention (prodsec-skills):
- SQL: parameterized queries only; no string concatenation
- Command: no shell=True, os.system, or backtick exec with user input
- LDAP/XPath: escape special characters in filters
- Path traversal: canonicalize paths, reject ../
- Deserialization: no pickle/yaml.load()/eval on untrusted data
- Prototype pollution: no recursive merge of untrusted objects
- Validate at trust boundaries with allow-lists, not deny-lists
- Normalize Unicode and anchor regexes (^$); watch for ReDoS
Files:
pkg/console/controllers/downloadsdeployment/controller.gopkg/console/controllers/upgradenotification/controller.gopkg/console/controllers/serviceaccounts/controller.gopkg/console/operator/retry.gopkg/console/controllers/storageversionmigration/controller.gopkg/console/controllers/oauthclients/oauthclients.gopkg/console/controllers/service/controller.gopkg/console/controllers/util/retry.gopkg/console/operator/sync_v400.gopkg/console/controllers/clidownloads/controller.gopkg/console/controllers/poddisruptionbudget/controller.gopkg/console/controllers/util/retry_test.gopkg/console/controllers/oauthclientsecret/oauthclientsecret.go
**/*_test.go
📄 CodeRabbit inference engine (Custom checks)
**/*_test.go: IPv6 and Disconnected Network Test Compatibility: When new Ginkgo e2e tests are added (It(), Describe(), Context(), When(), etc.), flag tests that contain IPv4 assumptions such as hardcoded IPv4 addresses, IPv4-only parsing logic, IPv4-specific service/endpoint creation, or hardcoded IPv4 network policies. Also flag tests requiring external connectivity to public internet services, pulling images from public registries, or DNS resolution of public hostnames.
MicroShift Test Compatibility: When new Ginkgo e2e tests are added (It(), Describe(), Context(), When(), etc.), flag tests that reference OpenShift-specific APIs unavailable on MicroShift (except Route and SecurityContextConstraints), such as Project/ProjectRequest, Build/BuildConfig, DeploymentConfig, ClusterOperator, ClusterVersion, Etcd operator, CSV/OLM, MachineSet/Machine, ClusterAutoscaler, Console, Monitoring components, ImageRegistry operator, Samples operator, OperatorHub, CloudCredential, or Storage/Network operators.
OTE Binary Stdout Contract: OpenShift Tests Extension (OTE) binaries must communicate with openshift-tests via JSON on stdout only. Flag any stdout writes in process-level code (main(), init(), TestMain(), BeforeSuite(), AfterSuite(), SynchronizedBeforeSuite(), RunSpecs() setup, or top-level var/const initializers). Common violations include klog writing to stdout (must redirect to stderr via klog.SetOutput(os.Stderr) or klog.LogToStderr(true)), and fmt.Print/Println/Printf to stdout in main or suite setup.
Single Node OpenShift (SNO) Test Compatibility: When new Ginkgo e2e tests are added (It(), Describe(), Context(), When(), etc.), flag tests that make multi-node or HA assumptions such as expecting multiple control-plane nodes, multiple worker nodes, pod anti-affinity across nodes, node-to-node communication patterns, leader election failover across replicas, pod rescheduling to different nodes, node scaling operations, or separate node roles.
Stable and Deterministic Te...
Files:
pkg/console/controllers/util/retry_test.go
⚙️ CodeRabbit configuration file
**/*_test.go: Review test code for quality and patterns.Refer to /unit-test-review when test is in pkg//*_test.go:**
- Table-driven test structure with test cases
- Use of
go-test/deepfor struct comparisons- Test naming conventions (TestFunctionName)
- Error handling with
wantErrpattern- Edge case coverage (nil, empty, boundary values)
- Proper assertions with helpful error messages
- Test isolation (no shared mutable state)
Refer to /e2e-test-review when test contains:
framework.MustNewClientset(t, nil)or similar e2e framework usagewait.Pollorwait.PollImmediatepatternsretry.RetryOnConflictfor updates- Cleanup via
deferfunctions- Console/operator CR manipulations
- Test assertions on cluster state
Suggest to use /e2e-test-review when:
- PR adds new feature requiring e2e coverage
- Test file is empty or skeleton
- Comments indicate "TODO: add test"
Review for common issues:
- Missing cleanup (defer statements)
- Using
time.Sleepinstead ofwait.Poll- Missing context timeouts
- Vague error messages in assertions
- Tests without table-driven structure when testing multiple cases
- Ignoring errors with
_- Tests without assertions
Files:
pkg/console/controllers/util/retry_test.go
🔇 Additional comments (12)
pkg/console/controllers/util/retry.go (1)
13-51: LGTM!pkg/console/controllers/util/retry_test.go (1)
11-165: LGTM!pkg/console/controllers/clidownloads/controller.go (1)
231-240: LGTM!Also applies to: 257-266
pkg/console/controllers/downloadsdeployment/controller.go (1)
121-134: LGTM!pkg/console/controllers/oauthclients/oauthclients.go (1)
229-233: LGTM!pkg/console/controllers/oauthclientsecret/oauthclientsecret.go (1)
164-167: LGTM!pkg/console/controllers/poddisruptionbudget/controller.go (1)
100-103: LGTM!pkg/console/controllers/serviceaccounts/controller.go (1)
155-166: LGTM!pkg/console/controllers/storageversionmigration/controller.go (1)
17-17: LGTM!Also applies to: 107-109
pkg/console/controllers/upgradenotification/controller.go (1)
134-147: LGTM!pkg/console/operator/retry.go (1)
8-10: LGTM!pkg/console/operator/sync_v400.go (1)
277-285: LGTM!Also applies to: 323-334, 434-441, 844-849
| var actual *corev1.ConfigMap | ||
| createErr := retryOnTransientError(func() error { | ||
| var e error | ||
| actual, e = co.configMapClient.ConfigMaps(required.Namespace).Create(ctx, required, metav1.CreateOptions{}) | ||
| return e | ||
| }) | ||
| if createErr == nil { | ||
| klog.V(4).Infoln("service-ca configmap created") | ||
| return actual, "", err | ||
| } else { | ||
| return actual, "FailedCreate", err | ||
| return actual, "", nil | ||
| } | ||
| return actual, "FailedCreate", createErr |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Verify create retry call sites.
rg -n "retryOnTransientError\\(func\\(\\) error \\{[\\s\\S]*?Create\\(" pkg/console/operator/sync_v400.go -U
# Verify how AlreadyExists is classified in retry utility.
rg -n "AlreadyExists|IsAlreadyExists|IsRetryableError" pkg/console/controllers/util/retry.go -C3Repository: openshift/console-operator
Length of output: 11743
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Show exact code around the flagged create retry blocks.
sed -n '500,590p' pkg/console/operator/sync_v400.go
# Locate and show retry helper definitions (name + implementation).
rg -n "retryOnTransientError|RetryOnTransientError" pkg/console/operator/sync_v400.go pkg/console/controllers/util/retry.go
sed -n '1,120p' pkg/console/controllers/util/retry.goRepository: openshift/console-operator
Length of output: 5929
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "retryOnTransientError" pkg/console/operator/sync_v400.go pkg/console -SRepository: openshift/console-operator
Length of output: 1718
🏁 Script executed:
#!/bin/bash
set -euo pipefail
sed -n '1,160p' pkg/console/operator/retry.goRepository: openshift/console-operator
Length of output: 276
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "\"FailedCreate\"|FailedCreate" pkg/console/operator -SRepository: openshift/console-operator
Length of output: 233
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "SyncServiceCAConfigMap\\(|SyncTrustedCAConfigMap\\(" pkg/console/operator -SRepository: openshift/console-operator
Length of output: 756
🏁 Script executed:
#!/bin/bash
set -euo pipefail
sed -n '120,220p' pkg/console/operator/sync_v400.goRepository: openshift/console-operator
Length of output: 4300
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "func HandleProgressingOrDegraded\\(" -S pkg/console-operator pkg || true
rg -n "HandleProgressingOrDegraded\\(" -S pkg/console/operator pkg/console/controllers pkg | headRepository: openshift/console-operator
Length of output: 2161
🏁 Script executed:
sed -n '80,180p' pkg/console/status/status.goRepository: openshift/console-operator
Length of output: 4251
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "func IsSyncError\\(|IsSyncError\\(" pkg/console -SRepository: openshift/console-operator
Length of output: 205
🏁 Script executed:
sed -n '1,120p' pkg/console/errors/sync_errors.goRepository: openshift/console-operator
Length of output: 920
Avoid Degraded blips when ConfigMap Create succeeded but retry hits AlreadyExists
SyncServiceCAConfigMapandSyncTrustedCAConfigMapretryConfigMaps(...).Create()viaretryOnTransientError;apierrors.IsAlreadyExistsis treated as non-retryable, so a second attempt that returnsAlreadyExistsleads to returning"FailedCreate".- With a transient failure after server-side success, this returns a non-nil error into
status.HandleProgressingOrDegraded(...), settingDegraded=TrueforServiceCASync/TrustedCASynceven though the ConfigMap exists.
Suggested fix pattern
createErr := retryOnTransientError(func() error {
var e error
actual, e = co.configMapClient.ConfigMaps(required.Namespace).Create(ctx, required, metav1.CreateOptions{})
return e
})
+ if apierrors.IsAlreadyExists(createErr) {
+ actual, createErr = co.configMapClient.ConfigMaps(required.Namespace).Get(ctx, required.Name, metav1.GetOptions{})
+ if createErr == nil {
+ return actual, "", nil
+ }
+ }
if createErr == nil {
return actual, "", nil
}Apply the same AlreadyExists -> Get handling to both the service-ca and trusted-ca Create retries in pkg/console/operator/sync_v400.go.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| var actual *corev1.ConfigMap | |
| createErr := retryOnTransientError(func() error { | |
| var e error | |
| actual, e = co.configMapClient.ConfigMaps(required.Namespace).Create(ctx, required, metav1.CreateOptions{}) | |
| return e | |
| }) | |
| if createErr == nil { | |
| klog.V(4).Infoln("service-ca configmap created") | |
| return actual, "", err | |
| } else { | |
| return actual, "FailedCreate", err | |
| return actual, "", nil | |
| } | |
| return actual, "FailedCreate", createErr | |
| var actual *corev1.ConfigMap | |
| createErr := retryOnTransientError(func() error { | |
| var e error | |
| actual, e = co.configMapClient.ConfigMaps(required.Namespace).Create(ctx, required, metav1.CreateOptions{}) | |
| return e | |
| }) | |
| if apierrors.IsAlreadyExists(createErr) { | |
| actual, createErr = co.configMapClient.ConfigMaps(required.Namespace).Get(ctx, required.Name, metav1.GetOptions{}) | |
| if createErr == nil { | |
| return actual, "", nil | |
| } | |
| } | |
| if createErr == nil { | |
| klog.V(4).Infoln("service-ca configmap created") | |
| return actual, "", nil | |
| } | |
| return actual, "FailedCreate", createErr |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/console/operator/sync_v400.go` around lines 517 - 527, The Create retry
may succeed server-side then later return apierrors.IsAlreadyExists, causing
SyncServiceCAConfigMap and SyncTrustedCAConfigMap to surface a create error and
flip Degraded; update the createErr handling after retryOnTransientError (the
block that sets actual and createErr from retryOnTransientError calling
configMapClient.ConfigMaps(...).Create) to detect apierrors.IsAlreadyExists and
in that case call configMapClient.ConfigMaps(...).Get(ctx, required.Name,
metav1.GetOptions{}) to load the existing ConfigMap into actual and clear the
error (return actual, "", nil); apply the same AlreadyExists->Get pattern in
both SyncServiceCAConfigMap and SyncTrustedCAConfigMap so transient post-success
create races do not produce FailedCreate.
There was a problem hiding this comment.
[Claude] Same reasoning as the reply above — the Create paths here are only reached when the lister returned NotFound. AlreadyExists is already classified as non-retryable in IsRetryableError, so the retry stops immediately on that error with no wasted attempts.
The phantom-success scenario (server-side Create succeeds but client gets a timeout, retry hits AlreadyExists) is a narrow race that self-heals on the next 60-second reconcile. Adding a Get-on-AlreadyExists fallback would add complexity for an edge case that resolves itself within one sync cycle. The original code before this PR had the same behavior — AlreadyExists on Create was always returned as a failure.
There was a problem hiding this comment.
@sg00dwin, fair point — since AlreadyExists is already classified as non-retryable, the retry wrapper exits immediately on that error without looping, and the phantom-success race is both narrow and self-correcting within one reconcile cycle. The added complexity of a Get-on-AlreadyExists fallback isn't justified for such an edge case. Understood, withdrawing the suggestion.
|
@sg00dwin: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
The console ClusterOperator intermittently flips
Degraded=Trueduring upgrade and CI runs when transient API errors (conflicts, timeouts, connection refused) hitresourceapply.Apply*()calls. The error flows directly to the status handler with no retry, settingDegraded=Truefor ~1 minute until the next sync resolves it. This causes CI test failures (8.7% failure rate in OCP 5.0).This PR adds a shared retry utility and wraps all API write operations that can trigger Degraded conditions, absorbing transient errors before they reach status reporting.
Changes
pkg/console/controllers/util/retry.go— sharedRetryOnTransientError()with backoff (3 steps, 500ms base, 2x factor, ~3.5s max). Retries all errors except permanent ones (Forbidden, Invalid, AlreadyExists, etc.).pkg/console/controllers/util/retry_test.go— 10 unit tests covering error classification and retry behavior.pkg/console/operator/retry.go— thin wrapper for the operator package.sync_v400.go— ConfigMap, ServiceCA, TrustedCA, Deployment, ConsoleConfig, PublicConfig, SessionSecretTesting
make test-unitpasses cleangofmt/go vetcleanSummary by CodeRabbit
Bug Fixes
Tests