fix(oauth): narrow context.lock scope in async_auth_flow (rebase of #2660 by @peisuke, closes #2847) by Bartok9 · Pull Request #2858 · modelcontextprotocol/python-sdk

Bartok9 · 2026-06-13T14:14:22Z

Summary

This is a conflict-free rebase of #2660 by @peisuke onto current main, opened as a fresh, mergeable PR so it can land without waiting on the original draft to be un-conflicted. All credit for the fix goes to @peisuke — the four commits below preserve their original authorship; I only resolved the rebase against main.

Closes #2847. Addresses the same RuntimeError: The current task is not holding this lock reported in #2644 and #2847.

Why a new PR

#2660 has been a CONFLICTING draft since May 22 with no maintainer review. Issue #2847 (June 12) is a second, independent production repro of the exact bug, and the reporter explicitly asks for #2660 to be reviewed/merged. The conflict against main is small and isolated to the Phase-1 refresh block. Rather than let the fix keep sitting, this PR provides a ready-to-merge version.

If maintainers prefer to merge #2660 directly, or if @peisuke wants to push the rebase to their own branch, this PR can be closed — the goal is just to unblock the fix. @peisuke, happy to defer entirely to you.

The bug

OAuthContext.lock is an anyio.Lock, which records task identity at acquire() and enforces same-task release(). async_auth_flow holds this lock across yield points. When httpx drives the generator from a different task during concurrent OAuth connections (e.g. Notion's frequent token refresh behind a gateway), release() raises RuntimeError: The current task is not holding this lock.

The fix (@peisuke's approach)

Narrow the lock scope so no HTTP yield (including the long-poll GET SSE and the token-refresh round trips) runs while holding context.lock, plus a single-flight refresh_lock with a re-check under the lock. This keeps trio portability (unlike swapping to asyncio.Lock).

Commits (authored by @peisuke)

fix(oauth): narrow async_auth_flow lock scope to avoid blocking long-poll requests
test: add regression coverage for OAuthClientProvider concurrent requests
test: cover refresh failure paths and double-check refresh skip
fix(oauth): cover Python 3.10/3.11 partial-branch coverage gap

Verification

Rebased onto current main (cf110e3). 0 behind, 4 ahead, touches only src/mcp/client/auth/oauth2.py and tests/client/test_auth.py.

$ uv run pytest tests/client/test_auth.py -q
108 passed, 1 xfailed (0.26s)

$ uv run ruff check src/mcp/client/auth/oauth2.py tests/client/test_auth.py
All checks passed!

cc @peisuke @EnGassa

…poll requests Previously the entire `OAuthClientProvider.async_auth_flow` body ran under `self.context.lock`, including the `yield request` that hands the request off to httpx. For requests that complete quickly this is fine, but a GET SSE long-poll holds the lock for the full SSE lifetime — which means any concurrent POST (e.g. `tools/call`) is blocked waiting for the lock, producing a ~16s perceived stall on lazy MCP connections that use OAuth. This commit splits the single coarse lock into purpose-specific scopes: Phase 1 (context.lock): initialize state, capture protocol_version, and decide whether a refresh is needed. Short-held; no HTTP I/O. Phase 2 (refresh_lock, new): single-flight token refresh. The refresh request `yield` happens outside any lock. A double-check inside `context.lock` ensures concurrent waiters do not redundantly refresh after another coroutine completed one. Phase 3 (no lock): add the auth header and yield the actual request. GET SSE long-polls and other long-running requests no longer block unrelated traffic. Phase 4 (context.lock): 401 / 403 full OAuth re-auth path. Conservatively kept under lock because this path is rare and its yielded sub-requests (metadata discovery, registration, token exchange) hit the AS, not the resource server. A future refactor can narrow this further. Lock additions: - `OAuthContext.refresh_lock: anyio.Lock` provides single-flight refresh so concurrent requests trigger at most one token refresh. Behavior changes: - Concurrent requests through the same `OAuthClientProvider` no longer serialize at the lock. GET SSE long-polls and POSTs now proceed in parallel. - Token refresh remains serialized (via `refresh_lock`), preserving the invariant that only one refresh request is in flight at a time. - Public API and behavior are otherwise unchanged. Related upstream issue: modelcontextprotocol#1326

…ests Two tests in a new TestConcurrentRequestsDoNotDeadlock class exercise the behavior the previous commit fixes: 1. ``test_concurrent_request_not_blocked_by_pending_long_running_request`` drives one async_auth_flow generator to its yield (= simulating a GET SSE long-poll suspended waiting for the next event) and then opens a second concurrent flow on the same provider. The second flow must reach its own yield within a short timeout — i.e., the lock release between Phase 1 and Phase 3 lets it through. Pre-fix, the second generator would block on context.lock indefinitely. 2. ``test_concurrent_token_refresh_is_single_flight`` exercises the refresh_lock single-flight path. A first flow performs the refresh yield; a second flow started after the refresh completes observes the freshly-updated token in Phase 1 and proceeds directly to its own request yield without issuing a second refresh. Also: tighten the refresh_request unbound-after-conditional-write pattern in async_auth_flow so pyright recognizes it as definitely assigned at the yield site (was: derived from a boolean predicate; now: typed as ``httpx.Request | None`` and checked explicitly).

After dropping the function-level `# pragma: no cover` from `_handle_refresh_response` and removing the per-line pragmas from the refactored Phase 2, the strict-no-cover audit identified covered lines still marked pragma'd and surfaced previously-untested branches. Three new tests close the coverage gaps: * `test_refresh_with_failed_status_clears_tokens` — exercises the ``response.status_code != 200`` branch in `_handle_refresh_response` and the `self._initialized = False` reset on refresh failure. * `test_refresh_with_invalid_json_clears_tokens` — exercises the ValidationError branch when the refresh body is not valid JSON. * `test_double_check_inside_refresh_lock_skips_second_refresh` — uses monkeypatch to flip `is_token_valid` between Phase 1 (False) and the double-check inside `refresh_lock` (True), exercising the branch where a second coroutine's refresh is correctly elided. Also: convert the new tests from the legacy Test* class pattern to plain top-level `test_*` functions per AGENTS.md, and drop unneeded per-line `# pragma: no cover` markers in the refactored auth_flow. Coverage report: 100.00% on `src/mcp/client/auth/oauth2.py`, strict-no-cover clean.

On Python 3.10/3.11, coverage.py (sys.settrace backend) misattributes the False arm of compound boolean predicates inside an async with block in an async generator, leading to spurious partial-branch warnings on 4 lines in async_auth_flow: - line 536: `if not is_token_valid() and can_refresh_token():` (Phase 1) - line 546: same (Phase 2 double-check inside refresh_lock) - line 548: `if refresh_request is not None:` (re-check skip path) - line 553: `if not await _handle_refresh_response(...):` (refresh success path) Python 3.12+ (sys.monitoring) tracks these branches correctly. Add `# pragma: no branch` to the 4 lines as a workaround for the legacy backend only. An inline comment block above the Phase 1 if documents the rationale. Also add 2 unit tests that explicitly exercise the affected branches so the coverage drop is shown to be a tool-precision issue, not missing tests: - test_phase1_skips_refresh_when_token_valid: Phase 1 sees is_token_valid=True and skips Phase 2 entirely. - test_refresh_success_proceeds_to_phase3_without_resetting_initialized: _handle_refresh_response returns True (HTTP 200) so the _initialized reset is skipped and Phase 3 proceeds with the fresh token. Local verification (Python 3.10): pytest tests/client/test_auth.py -n auto + coverage report → 99.43% (was 98.30%, BrPart 1, Missing only line 206 which is outside the test_auth.py scope). Full suite should reach 100%.

…plicate authorization_code)

peisuke added 4 commits June 13, 2026 02:05

fix(auth): restart full flow exactly once after failed refresh (no du…

8ffed4a

…plicate authorization_code)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(oauth): narrow context.lock scope in async_auth_flow (rebase of #2660 by @peisuke, closes #2847)#2858

fix(oauth): narrow context.lock scope in async_auth_flow (rebase of #2660 by @peisuke, closes #2847)#2858
Bartok9 wants to merge 5 commits into
modelcontextprotocol:mainfrom
Bartok9:salvage/2660-oauth-lock-rebase

Bartok9 commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Bartok9 commented Jun 13, 2026

Summary

Why a new PR

The bug

The fix (@peisuke's approach)

Commits (authored by @peisuke)

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants