Skip to content

OCPBUGS-86221: Update the well-known readiness controller to properly set cluster-operator conditions#901

Open
everettraven wants to merge 7 commits into
openshift:masterfrom
everettraven:bugfix/wellknownready-status
Open

OCPBUGS-86221: Update the well-known readiness controller to properly set cluster-operator conditions#901
everettraven wants to merge 7 commits into
openshift:masterfrom
everettraven:bugfix/wellknownready-status

Conversation

@everettraven
Copy link
Copy Markdown
Contributor

@everettraven everettraven commented May 27, 2026

As part of investigating OCPBUGS-86221 it was found that the likely culprit was the well-known readiness controller setting Progressing=True when a kube-apiserver endpoint was not yet accessible during a node reboot being done by the MCO.

This PR attempts to resolve this by aligning the status conditions that are set by this controller to better align with the status condition type meanings as outlined in https://github.com/openshift/api/blob/09730f85d8835712c0412a585685ddadb01e171b/config/v1/types_cluster_operator.go#L151.

To facilitate this, the following changes were made:

  • Available gets set to True when any of the kube-apiservers returns a valid response for the well-known endpoint.
  • Progressing is replaced by DegradationObserved when any of the kube-apiservers is not sufficiently serving their well-known endpoint. This controller has no operands, making Progressing a condition it should not be setting.
  • Degraded is set after 5 minutes of a given DegradationObserved state continuing to be observed.

To ensure the behavior is working as expected, additional unit tests were added to exercise this new behavior.

Summary by CodeRabbit

  • New Features

    • Introduced time-based controller degradation tracking and reporting for clearer health visibility
    • Refined well-known endpoint readiness checks with improved aggregation and error classification
  • Bug Fixes / Behavior Changes

    • Replaced previous "progressing" condition with a "degradation observed" condition for readiness reporting
  • Tests

    • Added comprehensive unit and end-to-end tests covering endpoint scenarios and degradation timeouts

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels May 27, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@everettraven: This pull request references Jira Issue OCPBUGS-86221, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

As part of investigating OCPBUGS-86221 it was found that the likely culprit was the well-known readiness controller setting Progressing=True when a kube-apiserver endpoint was not yet accessible during a node reboot being done by the MCO.

This PR attempts to resolve this by aligning the status conditions that are set by this controller to better align with the status condition type meanings as outlined in https://github.com/openshift/api/blob/09730f85d8835712c0412a585685ddadb01e171b/config/v1/types_cluster_operator.go#L151.

To facilitate this, the following changes were made:

  • Available gets set to True when any of the kube-apiservers returns a valid response for the well-known endpoint.
  • Progressing is replaced by DegradationObserved when any of the kube-apiservers is not sufficiently serving their well-known endpoint. This controller has no operands, making Progressing a condition it should not be setting.
  • Degraded is set after 5 minutes of a given DegradationObserved state continuing to be observed.

To ensure the behavior is working as expected, additional unit tests were added to exercise this new behavior.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

Walkthrough

This PR replaces the progressing-error model with ControllerDegradationObservedError, refactors well-known readiness to inject CA data and aggregate per-IP endpoint checks returning (available, error), updates controller status wiring to emit degradation-observed conditions for partial failures, and adds TLS-based tests exercising endpoint health and sync condition behavior.

Changes

Controller Degradation-Observed Error Model

Layer / File(s) Summary
Remove ControllerProgressingError
pkg/controllers/common/conditions.go
Deletes the old ControllerProgressingError type and related helpers; removes the time import.
ControllerDegradationObservedError type definition and tests
pkg/controllers/readiness/error.go, pkg/controllers/readiness/error_test.go
Adds ControllerDegradationObservedError with reason, wrapped error, maxAge, Error()/Unwrap(), ToCondition(), IsDegraded(), and ControllerDegradationObservedConditionName(); updates unit tests to validate degraded timing logic.
Dependency injection and wiring
pkg/controllers/readiness/wellknown_ready_controller.go
Introduces oidcAvailabler and caDataGetter interfaces, updates controller struct and constructor to accept injected auth checker and default CA-data getter.
isWellknownEndpointsReady endpoint checking refactor
pkg/controllers/readiness/wellknown_ready_controller.go
Rewrites isWellknownEndpointsReady to return (bool, error), use injected CA data, build TLS transport, check each API server IP, aggregate per-IP errors with errors.Join, and return degradation-observed on partial failures.
Sync method integration with degradation conditions
pkg/controllers/readiness/wellknown_ready_controller.go
Updates deferred operator status to apply WellKnownAvailable plus the degradation-observed condition; consumes (available, err) results, unwraps/constructs ControllerDegradationObservedError into conditions, and classifies partial failures as available-with-degradation.
Test infrastructure for TLS endpoint testing
pkg/controllers/readiness/wellknown_ready_controller_test.go
Adds crypto/TLS helpers, fake CA-data getter, sequential HTTP handler, TLS test server bootstrap, and Service/Endpoints fixtures for HTTPS endpoint tests.
Endpoint readiness and sync test coverage
pkg/controllers/readiness/wellknown_ready_controller_test.go, test/e2e-oidc/external_oidc_test.go
Adds Test_wellKnownReadyController_isWellknownEndpointsReady and Test_wellKnownReadyController_sync table-driven tests covering healthy, partial, total failure, and stale-metadata scenarios; updates e2e expected condition name to WellKnownReadyControllerDegradationObserved.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 13 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning Tests lack context timeouts on operations that interact with the cluster (sync, endpoint checks) and sequentialHandler has unguarded index access that can panic with extra requests. Add context.WithTimeout() wrapping around context.Background() in both test functions; fix sequentialHandler.ServeHTTP() to guard against index out of bounds as suggested in review comments.
✅ Passed checks (13 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately describes the main change: updating the well-known readiness controller to set proper cluster-operator conditions, moving from Progressing to DegradationObserved semantics. The title is specific, concise, and directly reflects the core objective.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names are static and deterministic without dynamic values, UUIDs, timestamps, IPs, or generated suffixes.
Microshift Test Compatibility ✅ Passed No new Ginkgo e2e tests were added. PR contains only standard Go unit tests and modifies existing e2e tests using the testing package, not Ginkgo.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No new Ginkgo e2e tests were added. PR contains standard Go unit tests and updates to existing tests only.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies only Go controller logic with no deployment manifests or scheduling constraints. No pod affinity, anti-affinity, topology spread constraints, or topology-dependent logic introduced.
Ote Binary Stdout Contract ✅ Passed PR does not modify OTE binary or packages it imports. Changed files are controller code and non-OTE tests with no process-level stdout writes.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No new Ginkgo e2e tests were added in this PR. Tests added use standard Go testing package, not Ginkgo, so check does not apply.
No-Weak-Crypto ✅ Passed No weak crypto algorithms found. Uses only modern crypto: ECDSA P256, RSA-256, TLS, X.509. No custom crypto or insecure secret comparisons.
Container-Privileges ✅ Passed PR modifies only Go source files; no Kubernetes or container manifest files changed, so container-privilege check is not applicable.
No-Sensitive-Data-In-Logs ✅ Passed No passwords, tokens, API keys, PII, or other sensitive data are exposed in logs. Internal cluster IPs in diagnostic messages are standard operator practice.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from flavianmissi and p0lyn0mial May 27, 2026 17:20
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 27, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign gangwgr for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@everettraven: This pull request references Jira Issue OCPBUGS-86221, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

As part of investigating OCPBUGS-86221 it was found that the likely culprit was the well-known readiness controller setting Progressing=True when a kube-apiserver endpoint was not yet accessible during a node reboot being done by the MCO.

This PR attempts to resolve this by aligning the status conditions that are set by this controller to better align with the status condition type meanings as outlined in https://github.com/openshift/api/blob/09730f85d8835712c0412a585685ddadb01e171b/config/v1/types_cluster_operator.go#L151.

To facilitate this, the following changes were made:

  • Available gets set to True when any of the kube-apiservers returns a valid response for the well-known endpoint.
  • Progressing is replaced by DegradationObserved when any of the kube-apiservers is not sufficiently serving their well-known endpoint. This controller has no operands, making Progressing a condition it should not be setting.
  • Degraded is set after 5 minutes of a given DegradationObserved state continuing to be observed.

To ensure the behavior is working as expected, additional unit tests were added to exercise this new behavior.

Summary by CodeRabbit

  • New Features

  • Enhanced controller degradation tracking with time-based monitoring windows

  • Improved operator condition reporting for better visibility into controller health status

  • Refined well-known endpoint readiness checks with comprehensive error handling

  • Tests

  • Added comprehensive end-to-end controller readiness testing scenarios

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/controllers/readiness/wellknown_ready_controller_test.go (1)

51-52: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Handle http.ResponseWriter.Write errors instead of discarding them.

Line 51 and Line 222 ignore Write errors, which violates the Go guideline and can mask handler failures in tests.

Proposed fix
 func (trt *testServer) ServeHTTP(w http.ResponseWriter, req *http.Request) {
 	if trt.responseDelay > 0 {
 		time.Sleep(trt.responseDelay)
 	}

 	w.WriteHeader(trt.sendStatus)
-	w.Write(trt.sendData)
+	if _, err := w.Write(trt.sendData); err != nil {
+		return
+	}
 }
@@
 func (h *sequentialHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
 	h.mu.Lock()
 	resp := h.responses[h.idx]
 	h.idx++
 	h.mu.Unlock()
 	w.WriteHeader(resp.status)
-	w.Write(resp.body)
+	if _, err := w.Write(resp.body); err != nil {
+		return
+	}
 }

As per coding guidelines, **/*.go: Go security (prodsec-skills): "Never ignore error returns".

Also applies to: 221-223

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controllers/readiness/wellknown_ready_controller_test.go` around lines 51
- 52, The test currently discards the error returned by
http.ResponseWriter.Write when writing trt.sendData; update the test to capture
that error and fail the test on non-nil (e.g., if _, err :=
w.Write(trt.sendData); err != nil { t.Fatalf("write failed: %v", err) } or use
require.NoError(t, err)). Locate the occurrences that write trt.sendData (the
Write call in the readiness test helper and the other similar write) and replace
the ignored writes with error-checked writes so any handler write failures cause
the test to fail.
🧹 Nitpick comments (2)
pkg/controllers/readiness/wellknown_ready_controller_test.go (1)

216-220: ⚡ Quick win

Guard sequentialHandler against out-of-range response indexing.

Line 218 can panic when requests exceed configured responses (e.g., retry behavior changes), making tests brittle.

Proposed fix
 func (h *sequentialHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
 	h.mu.Lock()
-	resp := h.responses[h.idx]
-	h.idx++
+	if len(h.responses) == 0 {
+		h.mu.Unlock()
+		w.WriteHeader(http.StatusInternalServerError)
+		return
+	}
+	resp := h.responses[len(h.responses)-1]
+	if h.idx < len(h.responses) {
+		resp = h.responses[h.idx]
+		h.idx++
+	}
 	h.mu.Unlock()
 	w.WriteHeader(resp.status)
 	if _, err := w.Write(resp.body); err != nil {
 		return
 	}
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controllers/readiness/wellknown_ready_controller_test.go` around lines
216 - 220, The ServeHTTP method on sequentialHandler currently indexes
h.responses by h.idx without bounds checking which can panic; update ServeHTTP
to acquire the mutex, check if h.idx is >= len(h.responses) (or len==0) before
accessing, and if out-of-range either cap h.idx to len(h.responses)-1 (so the
last configured response is reused) or return a safe default response (e.g.,
500) — then increment or leave the index appropriately; make this change inside
the sequentialHandler.ServeHTTP function to prevent panics when requests exceed
configured responses.
pkg/controllers/readiness/wellknown_ready_controller.go (1)

294-307: ⚡ Quick win

defer cancel() inside loop accumulates deferred calls.

The defer cancel() on line 296 runs inside the retry loop, causing deferred cancellations to accumulate until the function returns. Call cancel() explicitly after each iteration or use a closure.

♻️ Proposed fix
 	for i := 0; i < 3; i++ {
 		reqCtx, cancel := context.WithTimeout(ctx, 5*time.Second)
-		defer cancel()
 
 		req := req.WithContext(reqCtx)
 		resp, err = rt.RoundTrip(req)
 		if err != nil {
 			connError = fmt.Errorf("failed to GET kube-apiserver oauth endpoint %s: %w%s", wellKnown, err, wellKnownRoundtripErrorHint(err))
 			klog.V(2).Info(connError)
+			cancel()
 			continue
 		}
 		connError = nil
-		defer resp.Body.Close()
+		cancel()
 		break
 	}
+	if resp != nil {
+		defer resp.Body.Close()
+	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controllers/readiness/wellknown_ready_controller.go` around lines 294 -
307, The loop creates a context with `reqCtx, cancel := context.WithTimeout(ctx,
5*time.Second)` and uses `defer cancel()` inside the retry loop causing deferred
cancels to accumulate; change this by removing `defer cancel()` and calling
`cancel()` explicitly at the end of each iteration (or wrap the loop body in an
anonymous func so `defer cancel()` is scoped per-iteration) where `reqCtx` and
`cancel` are created (symbols: `reqCtx`, `cancel`, `req.WithContext(reqCtx)`,
`rt.RoundTrip(req)`) to ensure the timeout context is released immediately after
each roundtrip attempt.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/controllers/readiness/error.go`:
- Around line 45-53: The IsDegraded method on ControllerDegradationObservedError
doesn't guard against a nil lastStatus and will panic when dereferencing
lastStatus.Conditions; update ControllerDegradationObservedError.IsDegraded to
return false immediately if the lastStatus parameter is nil before calling
v1helpers.FindOperatorCondition(…) or using
ControllerDegradationObservedConditionName(controllerName), ensuring callers
that pass a nil operator status are safely handled.

In `@pkg/controllers/readiness/wellknown_ready_controller.go`:
- Around line 195-203: The current type assertion on err for
*ControllerDegradationObservedError can miss wrapped errors (e.g., errors.Join),
so replace the direct assertion with errors.As to unwrap and find a
*ControllerDegradationObservedError instance; then call
degradationErr.IsDegraded(controllerName, operatorStatus) and return
degradationErr.Unwrap() when degraded, otherwise set degradationObserved =
degradationErr.ToCondition(controllerName) and return nil, preserving the
existing logic that used Unwrap, ToCondition, controllerName, operatorStatus and
degradationObserved.

---

Outside diff comments:
In `@pkg/controllers/readiness/wellknown_ready_controller_test.go`:
- Around line 51-52: The test currently discards the error returned by
http.ResponseWriter.Write when writing trt.sendData; update the test to capture
that error and fail the test on non-nil (e.g., if _, err :=
w.Write(trt.sendData); err != nil { t.Fatalf("write failed: %v", err) } or use
require.NoError(t, err)). Locate the occurrences that write trt.sendData (the
Write call in the readiness test helper and the other similar write) and replace
the ignored writes with error-checked writes so any handler write failures cause
the test to fail.

---

Nitpick comments:
In `@pkg/controllers/readiness/wellknown_ready_controller_test.go`:
- Around line 216-220: The ServeHTTP method on sequentialHandler currently
indexes h.responses by h.idx without bounds checking which can panic; update
ServeHTTP to acquire the mutex, check if h.idx is >= len(h.responses) (or
len==0) before accessing, and if out-of-range either cap h.idx to
len(h.responses)-1 (so the last configured response is reused) or return a safe
default response (e.g., 500) — then increment or leave the index appropriately;
make this change inside the sequentialHandler.ServeHTTP function to prevent
panics when requests exceed configured responses.

In `@pkg/controllers/readiness/wellknown_ready_controller.go`:
- Around line 294-307: The loop creates a context with `reqCtx, cancel :=
context.WithTimeout(ctx, 5*time.Second)` and uses `defer cancel()` inside the
retry loop causing deferred cancels to accumulate; change this by removing
`defer cancel()` and calling `cancel()` explicitly at the end of each iteration
(or wrap the loop body in an anonymous func so `defer cancel()` is scoped
per-iteration) where `reqCtx` and `cancel` are created (symbols: `reqCtx`,
`cancel`, `req.WithContext(reqCtx)`, `rt.RoundTrip(req)`) to ensure the timeout
context is released immediately after each roundtrip attempt.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1f880a7b-0e29-40d2-9a3c-6aab5fde03d6

📥 Commits

Reviewing files that changed from the base of the PR and between 67a3b6a and 9f9865b.

📒 Files selected for processing (3)
  • pkg/controllers/readiness/error.go
  • pkg/controllers/readiness/wellknown_ready_controller.go
  • pkg/controllers/readiness/wellknown_ready_controller_test.go

Comment thread pkg/controllers/readiness/error.go
Comment thread pkg/controllers/readiness/wellknown_ready_controller.go
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Would it be worth replacing ioutil with io since the former is deprecated, since we're touching this file?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in the latest commit

}
degradationObserved = degradationObserved.
WithStatus(operatorv1.ConditionTrue)
if progressingErr, ok := err.(*common.ControllerProgressingError); ok {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: common.ControllerProgressingError doesn't look like it's used anymore; maybe worth cleaning up?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch. I didn't even check with it being in the common package - just assumed something else would be using it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in the latest push.

available := applyoperatorv1.OperatorCondition().
WithType("WellKnownAvailable")
progressing := applyoperatorv1.OperatorCondition().
WithType(common.ControllerProgressingConditionName(controllerName))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: common.ControllerProgressingConditionName doesn't look like it's used anymore; maybe worth cleaning up?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in the latest push.

return false, errors.Join(errs...)
}

// TODO(everettraven): verify that this is no longer necessary.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you planning in verifying this in the scope of this PR, or at a later stage? To me it also looks like this is no longer necessary; if I understand the alreadyTrueOnce guard correctly, we won't need it any more because we are now going available when at least one endpoint is healthy, and if some are still rolling out, we're capturing that with the new partial-failure model.

If you're planning on revisiting this after this PR merges, maybe it'd be better to drop the code already and open an issue to track the verification.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to verify this within the scope of this PR by running some payload jobs to verify that this change won't trigger any regressions in existing tests (once I tackle additional work to take care of existing CI failures here)

…DegradationObserved'

so that the WellKnownReadyController no longer uses the 'Progressing' suffixed
condition type that causes the overall cluster operator 'Progressing' condition to
be set to True. This prevents the WellKnownReadyController from flapping the overall
'Progressing' condition because it does not actually have any operands.
Instead, to maintain the eventual setting of the Degraded condition, we replace it
with WellKnownReadyControllerDegradationObserved and monitor that condition state
to determine if we should flip the overall cluster operator to Degraded after some time.

Signed-off-by: Bryce Palmer <bpalmer@redhat.com>
…ast one well-known endpoint is available

so that we align with the Available condition semantics that define availability
as being accessible but can also be in a degraded state. Availability should not
flap based on temporary degradation like a node going down for a reboot.

Signed-off-by: Bryce Palmer <bpalmer@redhat.com>
…ting implementation

Signed-off-by: Bryce Palmer <bpalmer@redhat.com>
…tsReady and sync methods.

Implementation done/assisted by Claude Code and reviewed for accurracy/correctness by @everettraven

Signed-off-by: Bryce Palmer <bpalmer@redhat.com>
Signed-off-by: Bryce Palmer <bpalmer@redhat.com>
Signed-off-by: Bryce Palmer <bpalmer@redhat.com>
Signed-off-by: Bryce Palmer <bpalmer@redhat.com>
@everettraven everettraven force-pushed the bugfix/wellknownready-status branch from 455035e to dbdad61 Compare May 29, 2026 19:30
@everettraven
Copy link
Copy Markdown
Contributor Author

/testwith openshift/cluster-authentication-operator/e2e-agnostic openshift/origin#31236

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 29, 2026

@everettraven, testwith: Error processing request. ERROR:

could not determine job runs: requested job is invalid. needs to be formatted like: <org>/<repo>/<branch>/<variant?>/<job>. instead it was: openshift/cluster-authentication-operator/e2e-agnostic

@everettraven
Copy link
Copy Markdown
Contributor Author

/testwith openshift/cluster-authentication-operator/master/e2e-agnostic openshift/origin#31236

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
pkg/controllers/readiness/wellknown_ready_controller.go (1)

195-203: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use errors.As instead of direct type assertion for degradation-observed errors.

err.(*ControllerDegradationObservedError) misses wrapped/joined cases (Line 255), so maxAge-based degradation gating can be skipped when all endpoint errors are joined. Use errors.As before deciding immediate return.

Suggested fix
-		if degradationErr, ok := err.(*ControllerDegradationObservedError); ok {
+		var degradationErr *ControllerDegradationObservedError
+		if errors.As(err, &degradationErr) {
 			if degradationErr.IsDegraded(controllerName, operatorStatus) {
 				return degradationErr.Unwrap()
 			}
 			degradationObserved = degradationErr.ToCondition(controllerName)
 			return nil

Also applies to: 254-256, 368-368

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controllers/readiness/wellknown_ready_controller.go` around lines 195 -
203, The code currently uses a direct type assertion
err.(*ControllerDegradationObservedError) which fails to detect wrapped
instances; replace those assertions with errors.As to properly handle
wrapped/joined errors: in the blocks that reference
ControllerDegradationObservedError (the branches that call
IsDegraded(controllerName, operatorStatus), ToCondition(controllerName) and
Unwrap()), use var degradationErr *ControllerDegradationObservedError; if
errors.As(err, &degradationErr) { ... } else { return err } so IsDegraded,
ToCondition and Unwrap are invoked on the extracted degradationErr and
degradationObserved is set the same way; apply this change to the three
occurrences that currently use the direct type assertion.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/controllers/readiness/wellknown_ready_controller_test.go`:
- Around line 216-220: The ServeHTTP implementation in sequentialHandler reads
h.responses[h.idx] without bounds checking which can panic if more requests
arrive than entries; update ServeHTTP (method ServeHTTP on type
sequentialHandler) to guard the index by checking h.idx against len(h.responses)
while holding h.mu and use a fallback response (e.g., the last response or a
default) when h.idx is out of range, incrementing/clamping h.idx appropriately
to keep behavior deterministic and avoid panics under extra requests.

---

Duplicate comments:
In `@pkg/controllers/readiness/wellknown_ready_controller.go`:
- Around line 195-203: The code currently uses a direct type assertion
err.(*ControllerDegradationObservedError) which fails to detect wrapped
instances; replace those assertions with errors.As to properly handle
wrapped/joined errors: in the blocks that reference
ControllerDegradationObservedError (the branches that call
IsDegraded(controllerName, operatorStatus), ToCondition(controllerName) and
Unwrap()), use var degradationErr *ControllerDegradationObservedError; if
errors.As(err, &degradationErr) { ... } else { return err } so IsDegraded,
ToCondition and Unwrap are invoked on the extracted degradationErr and
degradationObserved is set the same way; apply this change to the three
occurrences that currently use the direct type assertion.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3c2cb4a6-0db0-44b2-9b69-09e785682a43

📥 Commits

Reviewing files that changed from the base of the PR and between 9f9865b and dbdad61.

📒 Files selected for processing (6)
  • pkg/controllers/common/conditions.go
  • pkg/controllers/readiness/error.go
  • pkg/controllers/readiness/error_test.go
  • pkg/controllers/readiness/wellknown_ready_controller.go
  • pkg/controllers/readiness/wellknown_ready_controller_test.go
  • test/e2e-oidc/external_oidc_test.go
💤 Files with no reviewable changes (1)
  • pkg/controllers/common/conditions.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/controllers/readiness/error.go

Comment on lines +216 to +220
func (h *sequentialHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
h.mu.Lock()
resp := h.responses[h.idx]
h.idx++
h.mu.Unlock()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Guard sequentialHandler response index to avoid panic.

If the handler gets more requests than responses, h.responses[h.idx] panics. Add a bounds fallback to keep tests deterministic.

Suggested fix
 func (h *sequentialHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
 	h.mu.Lock()
-	resp := h.responses[h.idx]
-	h.idx++
+	idx := h.idx
+	if idx >= len(h.responses) {
+		idx = len(h.responses) - 1
+	}
+	resp := h.responses[idx]
+	if h.idx < len(h.responses) {
+		h.idx++
+	}
 	h.mu.Unlock()
 	w.WriteHeader(resp.status)
 	if _, err := w.Write(resp.body); err != nil {
 		return
 	}
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func (h *sequentialHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
h.mu.Lock()
resp := h.responses[h.idx]
h.idx++
h.mu.Unlock()
func (h *sequentialHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
h.mu.Lock()
idx := h.idx
if idx >= len(h.responses) {
idx = len(h.responses) - 1
}
resp := h.responses[idx]
if h.idx < len(h.responses) {
h.idx++
}
h.mu.Unlock()
w.WriteHeader(resp.status)
if _, err := w.Write(resp.body); err != nil {
return
}
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controllers/readiness/wellknown_ready_controller_test.go` around lines
216 - 220, The ServeHTTP implementation in sequentialHandler reads
h.responses[h.idx] without bounds checking which can panic if more requests
arrive than entries; update ServeHTTP (method ServeHTTP on type
sequentialHandler) to guard the index by checking h.idx against len(h.responses)
while holding h.mu and use a fallback response (e.g., the last response or a
default) when h.idx is out of range, incrementing/clamping h.idx appropriately
to keep behavior deterministic and avoid panics under extra requests.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 29, 2026

@everettraven: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-agnostic dbdad61 link true /test e2e-agnostic

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants