Skip to content

feat(auth): proactive OAuth token refresh with jitter to reduce concurrent refresh spikes#2859

Open
Bartok9 wants to merge 2 commits into
modelcontextprotocol:mainfrom
Bartok9:feat/oauth-proactive-refresh-jitter
Open

feat(auth): proactive OAuth token refresh with jitter to reduce concurrent refresh spikes#2859
Bartok9 wants to merge 2 commits into
modelcontextprotocol:mainfrom
Bartok9:feat/oauth-proactive-refresh-jitter

Conversation

@Bartok9

@Bartok9 Bartok9 commented Jun 13, 2026

Copy link
Copy Markdown

Summary

Refresh OAuth tokens proactively at ~80% of their lifetime with a small random jitter, instead of only reactively once they've already expired. This reduces the "thundering herd" of simultaneous token refreshes that occurs when a fleet of OAuth-backed MCP connectors is provisioned around the same time.

The production problem

When many MCP clients each hold an OAuth connection and were provisioned (or last refreshed) in roughly the same window, their access tokens all expire inside the same narrow window too. Today refresh only fires after is_token_valid() returns False (i.e. after hard expiry), so all of those clients try to refresh at nearly the same moment.

For a large fleet that produces a synchronized burst of grant_type=refresh_token requests against the authorization server — contention, rate-limit (429) responses, and spurious auth failures, all clustered into the same ~60s window. The herd then re-synchronizes on the new tokens and the spike repeats on the next cycle.

The design

Add a per-connection proactive refresh point that sits before hard expiry and is individually jittered, so a fleet desynchronizes naturally:

refresh_at = now + expires_in * refresh_fraction - jitter
  • refresh_fraction = 0.8 by default → refresh once 80% of the lifetime has elapsed, leaving headroom before hard expiry.
  • jitter ∈ [0, 30s] by default, always subtracted so it can only pull the refresh point earlier — it can never push past hard expiry. Each connector draws its own jitter, so refreshes spread out across the window rather than bunching up.

New pieces:

  1. calculate_token_refresh_time(expires_in, *, refresh_fraction=0.8, max_jitter_seconds=30.0, jitter=None) in src/mcp/shared/auth_utils.py — pure, deterministic-testable (inject jitter to bypass the RNG), returns None when expires_in is None.
  2. OAuthContext.token_refresh_time field, set alongside token_expiry_time in update_token_expiry and cleared in clear_tokens.
  3. OAuthContext.should_refresh_token()True when we hold refreshable tokens and we're past the jittered proactive-refresh point, even if the token is still technically valid.
  4. async_auth_flow Phase 1 now refreshes when the token is hard-invalid OR should_refresh_token() is True (while can_refresh_token()), keeping the existing re-check / lock structure intact.

is_token_valid() is deliberately unchanged — it still gates whether a token is usable at all (hard validity). Proactive refresh is layered on top.

Edge cases handled

  • expires_in is Nonetoken_refresh_time is None; should_refresh_token() returns False and behavior degrades to the existing reactive path.
  • Tiny TTLs (e.g. expires_in smaller than max_jitter_seconds): jitter is clamped to the available (refresh_at - now) window so the result never goes negative or before now.
  • Never past hard expiry: the result is always clamped into (now, hard_expiry].
  • String expires_in (some servers return it as a string): handled via int() like calculate_token_expiry.

Backward compatibility

Fully backward compatible. No public signatures change; defaults preserve the current behavior shape (proactive refresh is strictly an improvement, not a breaking change). Clients that never got an expires_in keep the old reactive behavior exactly.

Test coverage

  • tests/shared/test_auth_utils.py — 9 new tests for calculate_token_refresh_time: normal TTL within the jitter window and strictly before hard expiry, NoneNone, string expires_in, deterministic injected jitter, jitter ordering (more jitter → earlier), never-past-hard-expiry across many TTLs, tiny-TTL no-negative, zero-TTL collapse, custom fraction.
  • tests/client/test_auth.pyshould_refresh_token() predicate (hard-valid-but-past-window → True; fresh → False; no refresh time → False; no refresh token → False), plus two async_auth_flow integration tests: one proving a proactive refresh request is yielded while the token is still hard-valid, and one proving a fresh token is used directly with no refresh.

All tests/client/test_auth.py + tests/shared/test_auth_utils.py pass (128 passed, 1 xfailed). uv run ruff check, uv run ruff format --check, and uv run pyright are clean on all touched files.

Relationship to #2858

This is complementary to and independent of #2858. That PR addresses concurrency/locking of refresh (narrowing the anyio.Lock scope + single-flight refresh_lock to fix a RuntimeError). This PR is purely about when a refresh fires (proactive + jittered), not how it's locked. They touch different concerns and compose cleanly; if #2858 lands first this will need only a trivial rebase.

Credit

Motivated by production feedback from @Ben-Home (CorpusIQ) on #2847. Refs #2847, #2858.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant