OCPBUGS-86221: Update the well-known readiness controller to properly set cluster-operator conditions#901
Conversation
|
@everettraven: This pull request references Jira Issue OCPBUGS-86221, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughThis PR replaces the progressing-error model with ControllerDegradationObservedError, refactors well-known readiness to inject CA data and aggregate per-IP endpoint checks returning (available, error), updates controller status wiring to emit degradation-observed conditions for partial failures, and adds TLS-based tests exercising endpoint health and sync condition behavior. ChangesController Degradation-Observed Error Model
🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 13 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (13 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@everettraven: This pull request references Jira Issue OCPBUGS-86221, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/controllers/readiness/wellknown_ready_controller_test.go (1)
51-52:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winHandle
http.ResponseWriter.Writeerrors instead of discarding them.Line 51 and Line 222 ignore
Writeerrors, which violates the Go guideline and can mask handler failures in tests.Proposed fix
func (trt *testServer) ServeHTTP(w http.ResponseWriter, req *http.Request) { if trt.responseDelay > 0 { time.Sleep(trt.responseDelay) } w.WriteHeader(trt.sendStatus) - w.Write(trt.sendData) + if _, err := w.Write(trt.sendData); err != nil { + return + } } @@ func (h *sequentialHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) { h.mu.Lock() resp := h.responses[h.idx] h.idx++ h.mu.Unlock() w.WriteHeader(resp.status) - w.Write(resp.body) + if _, err := w.Write(resp.body); err != nil { + return + } }As per coding guidelines,
**/*.go: Go security (prodsec-skills): "Never ignore error returns".Also applies to: 221-223
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/controllers/readiness/wellknown_ready_controller_test.go` around lines 51 - 52, The test currently discards the error returned by http.ResponseWriter.Write when writing trt.sendData; update the test to capture that error and fail the test on non-nil (e.g., if _, err := w.Write(trt.sendData); err != nil { t.Fatalf("write failed: %v", err) } or use require.NoError(t, err)). Locate the occurrences that write trt.sendData (the Write call in the readiness test helper and the other similar write) and replace the ignored writes with error-checked writes so any handler write failures cause the test to fail.
🧹 Nitpick comments (2)
pkg/controllers/readiness/wellknown_ready_controller_test.go (1)
216-220: ⚡ Quick winGuard
sequentialHandleragainst out-of-range response indexing.Line 218 can panic when requests exceed configured responses (e.g., retry behavior changes), making tests brittle.
Proposed fix
func (h *sequentialHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) { h.mu.Lock() - resp := h.responses[h.idx] - h.idx++ + if len(h.responses) == 0 { + h.mu.Unlock() + w.WriteHeader(http.StatusInternalServerError) + return + } + resp := h.responses[len(h.responses)-1] + if h.idx < len(h.responses) { + resp = h.responses[h.idx] + h.idx++ + } h.mu.Unlock() w.WriteHeader(resp.status) if _, err := w.Write(resp.body); err != nil { return } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/controllers/readiness/wellknown_ready_controller_test.go` around lines 216 - 220, The ServeHTTP method on sequentialHandler currently indexes h.responses by h.idx without bounds checking which can panic; update ServeHTTP to acquire the mutex, check if h.idx is >= len(h.responses) (or len==0) before accessing, and if out-of-range either cap h.idx to len(h.responses)-1 (so the last configured response is reused) or return a safe default response (e.g., 500) — then increment or leave the index appropriately; make this change inside the sequentialHandler.ServeHTTP function to prevent panics when requests exceed configured responses.pkg/controllers/readiness/wellknown_ready_controller.go (1)
294-307: ⚡ Quick win
defer cancel()inside loop accumulates deferred calls.The
defer cancel()on line 296 runs inside the retry loop, causing deferred cancellations to accumulate until the function returns. Callcancel()explicitly after each iteration or use a closure.♻️ Proposed fix
for i := 0; i < 3; i++ { reqCtx, cancel := context.WithTimeout(ctx, 5*time.Second) - defer cancel() req := req.WithContext(reqCtx) resp, err = rt.RoundTrip(req) if err != nil { connError = fmt.Errorf("failed to GET kube-apiserver oauth endpoint %s: %w%s", wellKnown, err, wellKnownRoundtripErrorHint(err)) klog.V(2).Info(connError) + cancel() continue } connError = nil - defer resp.Body.Close() + cancel() break } + if resp != nil { + defer resp.Body.Close() + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/controllers/readiness/wellknown_ready_controller.go` around lines 294 - 307, The loop creates a context with `reqCtx, cancel := context.WithTimeout(ctx, 5*time.Second)` and uses `defer cancel()` inside the retry loop causing deferred cancels to accumulate; change this by removing `defer cancel()` and calling `cancel()` explicitly at the end of each iteration (or wrap the loop body in an anonymous func so `defer cancel()` is scoped per-iteration) where `reqCtx` and `cancel` are created (symbols: `reqCtx`, `cancel`, `req.WithContext(reqCtx)`, `rt.RoundTrip(req)`) to ensure the timeout context is released immediately after each roundtrip attempt.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/controllers/readiness/error.go`:
- Around line 45-53: The IsDegraded method on ControllerDegradationObservedError
doesn't guard against a nil lastStatus and will panic when dereferencing
lastStatus.Conditions; update ControllerDegradationObservedError.IsDegraded to
return false immediately if the lastStatus parameter is nil before calling
v1helpers.FindOperatorCondition(…) or using
ControllerDegradationObservedConditionName(controllerName), ensuring callers
that pass a nil operator status are safely handled.
In `@pkg/controllers/readiness/wellknown_ready_controller.go`:
- Around line 195-203: The current type assertion on err for
*ControllerDegradationObservedError can miss wrapped errors (e.g., errors.Join),
so replace the direct assertion with errors.As to unwrap and find a
*ControllerDegradationObservedError instance; then call
degradationErr.IsDegraded(controllerName, operatorStatus) and return
degradationErr.Unwrap() when degraded, otherwise set degradationObserved =
degradationErr.ToCondition(controllerName) and return nil, preserving the
existing logic that used Unwrap, ToCondition, controllerName, operatorStatus and
degradationObserved.
---
Outside diff comments:
In `@pkg/controllers/readiness/wellknown_ready_controller_test.go`:
- Around line 51-52: The test currently discards the error returned by
http.ResponseWriter.Write when writing trt.sendData; update the test to capture
that error and fail the test on non-nil (e.g., if _, err :=
w.Write(trt.sendData); err != nil { t.Fatalf("write failed: %v", err) } or use
require.NoError(t, err)). Locate the occurrences that write trt.sendData (the
Write call in the readiness test helper and the other similar write) and replace
the ignored writes with error-checked writes so any handler write failures cause
the test to fail.
---
Nitpick comments:
In `@pkg/controllers/readiness/wellknown_ready_controller_test.go`:
- Around line 216-220: The ServeHTTP method on sequentialHandler currently
indexes h.responses by h.idx without bounds checking which can panic; update
ServeHTTP to acquire the mutex, check if h.idx is >= len(h.responses) (or
len==0) before accessing, and if out-of-range either cap h.idx to
len(h.responses)-1 (so the last configured response is reused) or return a safe
default response (e.g., 500) — then increment or leave the index appropriately;
make this change inside the sequentialHandler.ServeHTTP function to prevent
panics when requests exceed configured responses.
In `@pkg/controllers/readiness/wellknown_ready_controller.go`:
- Around line 294-307: The loop creates a context with `reqCtx, cancel :=
context.WithTimeout(ctx, 5*time.Second)` and uses `defer cancel()` inside the
retry loop causing deferred cancels to accumulate; change this by removing
`defer cancel()` and calling `cancel()` explicitly at the end of each iteration
(or wrap the loop body in an anonymous func so `defer cancel()` is scoped
per-iteration) where `reqCtx` and `cancel` are created (symbols: `reqCtx`,
`cancel`, `req.WithContext(reqCtx)`, `rt.RoundTrip(req)`) to ensure the timeout
context is released immediately after each roundtrip attempt.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 1f880a7b-0e29-40d2-9a3c-6aab5fde03d6
📒 Files selected for processing (3)
pkg/controllers/readiness/error.gopkg/controllers/readiness/wellknown_ready_controller.gopkg/controllers/readiness/wellknown_ready_controller_test.go
There was a problem hiding this comment.
nit: Would it be worth replacing ioutil with io since the former is deprecated, since we're touching this file?
There was a problem hiding this comment.
Done in the latest commit
| } | ||
| degradationObserved = degradationObserved. | ||
| WithStatus(operatorv1.ConditionTrue) | ||
| if progressingErr, ok := err.(*common.ControllerProgressingError); ok { |
There was a problem hiding this comment.
nit: common.ControllerProgressingError doesn't look like it's used anymore; maybe worth cleaning up?
There was a problem hiding this comment.
Ah, good catch. I didn't even check with it being in the common package - just assumed something else would be using it.
There was a problem hiding this comment.
Done in the latest push.
| available := applyoperatorv1.OperatorCondition(). | ||
| WithType("WellKnownAvailable") | ||
| progressing := applyoperatorv1.OperatorCondition(). | ||
| WithType(common.ControllerProgressingConditionName(controllerName)) |
There was a problem hiding this comment.
nit: common.ControllerProgressingConditionName doesn't look like it's used anymore; maybe worth cleaning up?
There was a problem hiding this comment.
Done in the latest push.
| return false, errors.Join(errs...) | ||
| } | ||
|
|
||
| // TODO(everettraven): verify that this is no longer necessary. |
There was a problem hiding this comment.
Are you planning in verifying this in the scope of this PR, or at a later stage? To me it also looks like this is no longer necessary; if I understand the alreadyTrueOnce guard correctly, we won't need it any more because we are now going available when at least one endpoint is healthy, and if some are still rolling out, we're capturing that with the new partial-failure model.
If you're planning on revisiting this after this PR merges, maybe it'd be better to drop the code already and open an issue to track the verification.
There was a problem hiding this comment.
I'd like to verify this within the scope of this PR by running some payload jobs to verify that this change won't trigger any regressions in existing tests (once I tackle additional work to take care of existing CI failures here)
b097b41 to
455035e
Compare
…DegradationObserved' so that the WellKnownReadyController no longer uses the 'Progressing' suffixed condition type that causes the overall cluster operator 'Progressing' condition to be set to True. This prevents the WellKnownReadyController from flapping the overall 'Progressing' condition because it does not actually have any operands. Instead, to maintain the eventual setting of the Degraded condition, we replace it with WellKnownReadyControllerDegradationObserved and monitor that condition state to determine if we should flip the overall cluster operator to Degraded after some time. Signed-off-by: Bryce Palmer <bpalmer@redhat.com>
…ast one well-known endpoint is available so that we align with the Available condition semantics that define availability as being accessible but can also be in a degraded state. Availability should not flap based on temporary degradation like a node going down for a reboot. Signed-off-by: Bryce Palmer <bpalmer@redhat.com>
…ting implementation Signed-off-by: Bryce Palmer <bpalmer@redhat.com>
…tsReady and sync methods. Implementation done/assisted by Claude Code and reviewed for accurracy/correctness by @everettraven Signed-off-by: Bryce Palmer <bpalmer@redhat.com>
Signed-off-by: Bryce Palmer <bpalmer@redhat.com>
Signed-off-by: Bryce Palmer <bpalmer@redhat.com>
Signed-off-by: Bryce Palmer <bpalmer@redhat.com>
455035e to
dbdad61
Compare
|
/testwith openshift/cluster-authentication-operator/e2e-agnostic openshift/origin#31236 |
|
@everettraven, |
|
/testwith openshift/cluster-authentication-operator/master/e2e-agnostic openshift/origin#31236 |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
pkg/controllers/readiness/wellknown_ready_controller.go (1)
195-203:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse
errors.Asinstead of direct type assertion for degradation-observed errors.
err.(*ControllerDegradationObservedError)misses wrapped/joined cases (Line 255), so maxAge-based degradation gating can be skipped when all endpoint errors are joined. Useerrors.Asbefore deciding immediate return.Suggested fix
- if degradationErr, ok := err.(*ControllerDegradationObservedError); ok { + var degradationErr *ControllerDegradationObservedError + if errors.As(err, °radationErr) { if degradationErr.IsDegraded(controllerName, operatorStatus) { return degradationErr.Unwrap() } degradationObserved = degradationErr.ToCondition(controllerName) return nilAlso applies to: 254-256, 368-368
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/controllers/readiness/wellknown_ready_controller.go` around lines 195 - 203, The code currently uses a direct type assertion err.(*ControllerDegradationObservedError) which fails to detect wrapped instances; replace those assertions with errors.As to properly handle wrapped/joined errors: in the blocks that reference ControllerDegradationObservedError (the branches that call IsDegraded(controllerName, operatorStatus), ToCondition(controllerName) and Unwrap()), use var degradationErr *ControllerDegradationObservedError; if errors.As(err, °radationErr) { ... } else { return err } so IsDegraded, ToCondition and Unwrap are invoked on the extracted degradationErr and degradationObserved is set the same way; apply this change to the three occurrences that currently use the direct type assertion.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/controllers/readiness/wellknown_ready_controller_test.go`:
- Around line 216-220: The ServeHTTP implementation in sequentialHandler reads
h.responses[h.idx] without bounds checking which can panic if more requests
arrive than entries; update ServeHTTP (method ServeHTTP on type
sequentialHandler) to guard the index by checking h.idx against len(h.responses)
while holding h.mu and use a fallback response (e.g., the last response or a
default) when h.idx is out of range, incrementing/clamping h.idx appropriately
to keep behavior deterministic and avoid panics under extra requests.
---
Duplicate comments:
In `@pkg/controllers/readiness/wellknown_ready_controller.go`:
- Around line 195-203: The code currently uses a direct type assertion
err.(*ControllerDegradationObservedError) which fails to detect wrapped
instances; replace those assertions with errors.As to properly handle
wrapped/joined errors: in the blocks that reference
ControllerDegradationObservedError (the branches that call
IsDegraded(controllerName, operatorStatus), ToCondition(controllerName) and
Unwrap()), use var degradationErr *ControllerDegradationObservedError; if
errors.As(err, °radationErr) { ... } else { return err } so IsDegraded,
ToCondition and Unwrap are invoked on the extracted degradationErr and
degradationObserved is set the same way; apply this change to the three
occurrences that currently use the direct type assertion.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 3c2cb4a6-0db0-44b2-9b69-09e785682a43
📒 Files selected for processing (6)
pkg/controllers/common/conditions.gopkg/controllers/readiness/error.gopkg/controllers/readiness/error_test.gopkg/controllers/readiness/wellknown_ready_controller.gopkg/controllers/readiness/wellknown_ready_controller_test.gotest/e2e-oidc/external_oidc_test.go
💤 Files with no reviewable changes (1)
- pkg/controllers/common/conditions.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/controllers/readiness/error.go
| func (h *sequentialHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) { | ||
| h.mu.Lock() | ||
| resp := h.responses[h.idx] | ||
| h.idx++ | ||
| h.mu.Unlock() |
There was a problem hiding this comment.
Guard sequentialHandler response index to avoid panic.
If the handler gets more requests than responses, h.responses[h.idx] panics. Add a bounds fallback to keep tests deterministic.
Suggested fix
func (h *sequentialHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
h.mu.Lock()
- resp := h.responses[h.idx]
- h.idx++
+ idx := h.idx
+ if idx >= len(h.responses) {
+ idx = len(h.responses) - 1
+ }
+ resp := h.responses[idx]
+ if h.idx < len(h.responses) {
+ h.idx++
+ }
h.mu.Unlock()
w.WriteHeader(resp.status)
if _, err := w.Write(resp.body); err != nil {
return
}
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func (h *sequentialHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) { | |
| h.mu.Lock() | |
| resp := h.responses[h.idx] | |
| h.idx++ | |
| h.mu.Unlock() | |
| func (h *sequentialHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) { | |
| h.mu.Lock() | |
| idx := h.idx | |
| if idx >= len(h.responses) { | |
| idx = len(h.responses) - 1 | |
| } | |
| resp := h.responses[idx] | |
| if h.idx < len(h.responses) { | |
| h.idx++ | |
| } | |
| h.mu.Unlock() | |
| w.WriteHeader(resp.status) | |
| if _, err := w.Write(resp.body); err != nil { | |
| return | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/controllers/readiness/wellknown_ready_controller_test.go` around lines
216 - 220, The ServeHTTP implementation in sequentialHandler reads
h.responses[h.idx] without bounds checking which can panic if more requests
arrive than entries; update ServeHTTP (method ServeHTTP on type
sequentialHandler) to guard the index by checking h.idx against len(h.responses)
while holding h.mu and use a fallback response (e.g., the last response or a
default) when h.idx is out of range, incrementing/clamping h.idx appropriately
to keep behavior deterministic and avoid panics under extra requests.
|
@everettraven: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
As part of investigating OCPBUGS-86221 it was found that the likely culprit was the well-known readiness controller setting
Progressing=Truewhen a kube-apiserver endpoint was not yet accessible during a node reboot being done by the MCO.This PR attempts to resolve this by aligning the status conditions that are set by this controller to better align with the status condition type meanings as outlined in https://github.com/openshift/api/blob/09730f85d8835712c0412a585685ddadb01e171b/config/v1/types_cluster_operator.go#L151.
To facilitate this, the following changes were made:
Availablegets set toTruewhen any of the kube-apiservers returns a valid response for the well-known endpoint.Progressingis replaced byDegradationObservedwhen any of the kube-apiservers is not sufficiently serving their well-known endpoint. This controller has no operands, makingProgressinga condition it should not be setting.Degradedis set after 5 minutes of a givenDegradationObservedstate continuing to be observed.To ensure the behavior is working as expected, additional unit tests were added to exercise this new behavior.
Summary by CodeRabbit
New Features
Bug Fixes / Behavior Changes
Tests