🐛 fix: (boxcutter) Enable archival and spec changes for ProgressDeadlineExceeded revisions#2610
Conversation
✅ Deploy Preview for olmv1 ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
Updates the ClusterObjectSet controller’s update-event predicate so that once a ClusterObjectSet hits ProgressDeadlineExceeded, spec-driven updates (identified via metadata.generation changes) are still reconciled while status-only updates remain suppressed—preventing resources from getting stuck and unable to transition (e.g., to Archived).
Changes:
- Adjust
skipProgressDeadlineExceededPredicateto allow updates whenObjectOld.Generation != ObjectNew.Generation. - Preserve the existing behavior of suppressing status-only update churn after
ProgressDeadlineExceeded.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
internal/operator-controller/controllers/clusterobjectset_controller.go
Outdated
Show resolved
Hide resolved
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2610 +/- ##
==========================================
+ Coverage 68.95% 69.00% +0.05%
==========================================
Files 139 139
Lines 9891 9920 +29
==========================================
+ Hits 6820 6845 +25
- Misses 2562 2564 +2
- Partials 509 511 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
6b78b23 to
426beb1
Compare
| func (c *ClusterObjectSetReconciler) SetupWithManager(mgr ctrl.Manager) error { | ||
| skipProgressDeadlineExceededPredicate := predicate.Funcs{ | ||
| UpdateFunc: func(e event.UpdateEvent) bool { | ||
| rev, ok := e.ObjectNew.(*ocv1.ClusterObjectSet) |
There was a problem hiding this comment.
I would extract the whole update func into standalone function, not just the logic under shouldAllowProgressDeadlineExceededUpdate
There was a problem hiding this comment.
@pedjak
I think we might can remove the predicate.
See the current proposal now.
WDYT? I tried to update the PR desc as well
426beb1 to
59eaf3d
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Maybe I misunderstand the progress deadline exceeded stuff. But I thought exceeding the deadline literally only meant that we set Progressing=False? But otherwise all other functionality and decision making around reconciliation of the CE works the same as it does prior to the deadline exceeding. If that's not the case, why? |
59eaf3d to
b2da731
Compare
b2da731 to
bf9e524
Compare
|
Hi @joelanford I changed the proposal here. I think it might be a better way to solve it. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
internal/operator-controller/controllers/clusterobjectset_controller.go
Outdated
Show resolved
Hide resolved
internal/operator-controller/controllers/clusterobjectset_controller_internal_test.go
Outdated
Show resolved
Hide resolved
bf9e524 to
38e7024
Compare
pedjak
left a comment
There was a problem hiding this comment.
Three suggestions:
-
Add a comment explaining the
ReasonSucceededcarve-out. The guard usesreason != ReasonSucceeded, which means any new reason added in the future is silently blocked by default. That's the safe default, but it's a hidden constraint — a short comment explaining why onlySucceededis exempt would prevent a future contributor from accidentally breaking this. -
Add a debug log in the early return. When the guard fires, it silently swallows a state transition with zero signal. A
V(1)log line like"skipping markAsProgressing: ProgressDeadlineExceeded is sticky"would help debugging when someone is troubleshooting why a COS condition isn't updating. -
Consider an integration test for the archival scenario (follow-up). The unit tests cover
markAsProgressingwell, but the motivating bug (archival blocked after deadline exceeded) isn't tested e2e. A test simulating deadline exceeded →lifecycleState: Archived→ verify reconcile proceeds would cover the actual user-facing scenario.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
internal/operator-controller/controllers/clusterobjectset_controller_test.go
Show resolved
Hide resolved
5039200 to
d2e5ecb
Compare
d2e5ecb to
d881e80
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
internal/operator-controller/controllers/clusterextension_reconcile_steps.go
Show resolved
Hide resolved
592dcd2 to
7452c0b
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
internal/operator-controller/controllers/clusterextension_reconcile_steps.go
Outdated
Show resolved
Hide resolved
d840c14 to
d2dba24
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
internal/operator-controller/controllers/clusterextension_reconcile_steps.go:1
- The digest is sensitive to ordering in slices/arrays (e.g.,
Channels,MatchExpressions) even when different orderings may be semantically equivalent. This can cause unnecessary “spec changed → re-resolve” behavior. Consider canonicalizing inputs before hashing (e.g., sortChannels, and convertmetav1.LabelSelectorto a normalized string form viametav1.LabelSelectorAsSelector(...).String()or explicit sorting of requirements) to reduce unintended churn.
/*
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
internal/operator-controller/controllers/clusterextension_reconcile_steps.go
Show resolved
Hide resolved
internal/operator-controller/controllers/clusterextension_reconcile_steps.go
Show resolved
Hide resolved
2fa9f67 to
96d0dc1
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
internal/operator-controller/controllers/clusterextension_reconcile_steps.go
Outdated
Show resolved
Hide resolved
internal/operator-controller/controllers/clusterextension_reconcile_steps.go
Show resolved
Hide resolved
This commit addresses the reconcile loop and archival timeout issues when ProgressDeadlineExceeded is set on a ClusterObjectSet. Generated-By: Claude
96d0dc1 to
a4ce4b3
Compare
How it works on main
The COS reconciler checks if a revision took too long to roll out. If it did, it sets
Progressing=False/ProgressDeadlineExceeded.To avoid a reconcile loop after that,
mainhas a watch predicate that blocks allupdates to COS objects with ProgressDeadlineExceeded. The predicate only allows
deletions through.
What is the problem?
The predicate blocks too much. It blocks all updates, not just status updates.
Example: a new revision rolls out and tries to archive the old stuck one by patching
lifecycleState: Archived. The predicate sees ProgressDeadlineExceeded and drops theevent. The old revision never gets archived.
Example scenario
ProgressDeadlineMinutes, the reconciler setsProgressing=False/ProgressDeadlineExceeded.lifecycleState: Archivedso the old revision gets cleaned up.
COS-rev-1 never reconciles, never processes the archival, and stays stuck forever.
Why does the reconcile loop happen?
Every time the reconciler runs on a deadline-exceeded COS, it calls
markAsProgressing()which sets
Progressing=True. Then the deadline check immediately sets it back toProgressing=False/ProgressDeadlineExceeded. This back-and-forth produces a statusupdate, which triggers another reconcile, and so on.
The predicate was added to break this loop. But it also breaks archiving.
What is the fix?
1. Make ProgressDeadlineExceeded sticky (prevents reconcile loop)
Once ProgressDeadlineExceeded is set, markAsProgressing() will not overwrite
it with RollingOut for Active revisions. This prevents the status flip-flop
that caused the reconcile loop.
The guard only blocks RollingOut. Other reasons (Succeeded, Retrying) can
still update the condition because they represent meaningful state changes.
With no flip-flop, there are no unnecessary status updates, so no reconcile
loop. The watch predicate is removed.
2. Skip progress deadline enforcement for Archived revisions
The deadline check now skips Archived revisions:
This allows Archived revisions to retry teardown operations and update their
status without being overridden by the deadline check.
3. Detect spec changes during rollout (resolution digest)
A SHA256 digest of all catalog filter inputs (packageName, version, channels,
selector, upgradeConstraintPolicy) is stored in the revision's
olm.operatorframework.io/resolution-digest annotation.
Before reusing a rolling-out revision, the controller compares the current
spec's digest with the revision's digest. If they differ, a new revision is
created.
This ensures spec changes are not ignored when a revision has
ProgressDeadlineExceeded.
How it works
ProgressDeadlineExceeded
Motivation
Addresses feedback from:
openshift/operator-framework-operator-controller#687 (comment)