Skip to content

Retry past 2-attempt cap on system-scoped ResourceExhausted#10385

Open
prathyushpv wants to merge 4 commits into
mainfrom
ppv/retry-system-resource-exhausted
Open

Retry past 2-attempt cap on system-scoped ResourceExhausted#10385
prathyushpv wants to merge 4 commits into
mainfrom
ppv/retry-system-resource-exhausted

Conversation

@prathyushpv
Copy link
Copy Markdown
Contributor

@prathyushpv prathyushpv commented May 25, 2026

What changed?

  • New system.retryUnboundedOnSystemResourceExhausted global dynamic config (default false). When true, frontend calls to history/matching retry past the 2-attempt cap on system-scoped ResourceExhausted, bounded only by the policy's 1-minute expiration and the caller's context.
  • New backoff.ConditionalRetryPolicy that delegates to one of two policies based on a per-attempt predicate.

Why?

Today's 2-attempt cap surfaces transient capacity events as user-visible failures. We can hide transient system level resource exhausted errors.

@prathyushpv prathyushpv marked this pull request as ready for review May 27, 2026 18:52
@prathyushpv prathyushpv requested review from a team as code owners May 27, 2026 18:52
Add a new dynamic config `system.retryUnboundedOnSystemResourceExhausted`
(default false). When enabled, inter-service calls from frontend and CHASM
clients to history and matching retry past the standard 2-attempt cap on
system-scoped ResourceExhausted errors, bounded only by the retry policy's
default 1-minute expiration and the caller's context.

Implemented via a new backoff.ConditionalRetryPolicy that delegates to one
of two underlying policies based on a per-attempt predicate over the error.

Also fix an existing asymmetry in ThrottleRetryContext: a bare type assertion
caused wrapped ResourceExhausted errors (e.g., MultiOperationExecution) to
bypass the throttle-floor policy. Switched to errors.As so wrapped REs back
off consistently with unwrapped ones.

Namespace-scoped ResourceExhausted, non-RE transient errors, and matching
long-poll calls are unaffected.
…austed

The earlier swap from bare type assertion to errors.As was speculative
forward-compat with no actual caller benefit: MultiOperationExecution
(the only multi-error type in temporal-server) doesn't implement Unwrap(),
and nothing else wraps ResourceExhausted with %w. Revert to plain type
assertions to match the existing IsServiceClientTransientError style and
keep the diff focused on the dynamic-config gate.
The predicate uses a plain type assertion, so wrapped errors aren't a
concern in current callers. Remove the test and unused fmt import.
@prathyushpv prathyushpv force-pushed the ppv/retry-system-resource-exhausted branch from 7e0c0d3 to 4a3c575 Compare May 27, 2026 21:41
Copy link
Copy Markdown
Member

@yycptt yycptt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I was think of doing it in the frontend retry interceptor and only one place need to be changed. I don't have strong opinion here and your approach works as well. Just want to have this discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants