Retry past 2-attempt cap on system-scoped ResourceExhausted#10385
Open
prathyushpv wants to merge 4 commits into
Open
Retry past 2-attempt cap on system-scoped ResourceExhausted#10385prathyushpv wants to merge 4 commits into
prathyushpv wants to merge 4 commits into
Conversation
Add a new dynamic config `system.retryUnboundedOnSystemResourceExhausted` (default false). When enabled, inter-service calls from frontend and CHASM clients to history and matching retry past the standard 2-attempt cap on system-scoped ResourceExhausted errors, bounded only by the retry policy's default 1-minute expiration and the caller's context. Implemented via a new backoff.ConditionalRetryPolicy that delegates to one of two underlying policies based on a per-attempt predicate over the error. Also fix an existing asymmetry in ThrottleRetryContext: a bare type assertion caused wrapped ResourceExhausted errors (e.g., MultiOperationExecution) to bypass the throttle-floor policy. Switched to errors.As so wrapped REs back off consistently with unwrapped ones. Namespace-scoped ResourceExhausted, non-RE transient errors, and matching long-poll calls are unaffected.
…austed The earlier swap from bare type assertion to errors.As was speculative forward-compat with no actual caller benefit: MultiOperationExecution (the only multi-error type in temporal-server) doesn't implement Unwrap(), and nothing else wraps ResourceExhausted with %w. Revert to plain type assertions to match the existing IsServiceClientTransientError style and keep the diff focused on the dynamic-config gate.
The predicate uses a plain type assertion, so wrapped errors aren't a concern in current callers. Remove the test and unused fmt import.
7e0c0d3 to
4a3c575
Compare
yycptt
reviewed
May 30, 2026
Member
yycptt
left a comment
There was a problem hiding this comment.
oh I was think of doing it in the frontend retry interceptor and only one place need to be changed. I don't have strong opinion here and your approach works as well. Just want to have this discussion.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changed?
system.retryUnboundedOnSystemResourceExhaustedglobal dynamic config (defaultfalse). Whentrue, frontend calls to history/matching retry past the 2-attempt cap on system-scopedResourceExhausted, bounded only by the policy's 1-minute expiration and the caller's context.backoff.ConditionalRetryPolicythat delegates to one of two policies based on a per-attempt predicate.Why?
Today's 2-attempt cap surfaces transient capacity events as user-visible failures. We can hide transient system level resource exhausted errors.