Skip to content

fix(oauth): narrow context.lock scope in async_auth_flow (rebase of #2660 by @peisuke, closes #2847)#2858

Open
Bartok9 wants to merge 5 commits into
modelcontextprotocol:mainfrom
Bartok9:salvage/2660-oauth-lock-rebase
Open

fix(oauth): narrow context.lock scope in async_auth_flow (rebase of #2660 by @peisuke, closes #2847)#2858
Bartok9 wants to merge 5 commits into
modelcontextprotocol:mainfrom
Bartok9:salvage/2660-oauth-lock-rebase

Conversation

@Bartok9

@Bartok9 Bartok9 commented Jun 13, 2026

Copy link
Copy Markdown

Summary

This is a conflict-free rebase of #2660 by @peisuke onto current main, opened as a fresh, mergeable PR so it can land without waiting on the original draft to be un-conflicted. All credit for the fix goes to @peisuke — the four commits below preserve their original authorship; I only resolved the rebase against main.

Closes #2847. Addresses the same RuntimeError: The current task is not holding this lock reported in #2644 and #2847.

Why a new PR

#2660 has been a CONFLICTING draft since May 22 with no maintainer review. Issue #2847 (June 12) is a second, independent production repro of the exact bug, and the reporter explicitly asks for #2660 to be reviewed/merged. The conflict against main is small and isolated to the Phase-1 refresh block. Rather than let the fix keep sitting, this PR provides a ready-to-merge version.

If maintainers prefer to merge #2660 directly, or if @peisuke wants to push the rebase to their own branch, this PR can be closed — the goal is just to unblock the fix. @peisuke, happy to defer entirely to you.

The bug

OAuthContext.lock is an anyio.Lock, which records task identity at acquire() and enforces same-task release(). async_auth_flow holds this lock across yield points. When httpx drives the generator from a different task during concurrent OAuth connections (e.g. Notion's frequent token refresh behind a gateway), release() raises RuntimeError: The current task is not holding this lock.

The fix (@peisuke's approach)

Narrow the lock scope so no HTTP yield (including the long-poll GET SSE and the token-refresh round trips) runs while holding context.lock, plus a single-flight refresh_lock with a re-check under the lock. This keeps trio portability (unlike swapping to asyncio.Lock).

Commits (authored by @peisuke)

  • fix(oauth): narrow async_auth_flow lock scope to avoid blocking long-poll requests
  • test: add regression coverage for OAuthClientProvider concurrent requests
  • test: cover refresh failure paths and double-check refresh skip
  • fix(oauth): cover Python 3.10/3.11 partial-branch coverage gap

Verification

Rebased onto current main (cf110e3). 0 behind, 4 ahead, touches only src/mcp/client/auth/oauth2.py and tests/client/test_auth.py.

$ uv run pytest tests/client/test_auth.py -q
108 passed, 1 xfailed (0.26s)

$ uv run ruff check src/mcp/client/auth/oauth2.py tests/client/test_auth.py
All checks passed!

cc @peisuke @EnGassa

peisuke added 4 commits June 13, 2026 02:05
…poll requests

Previously the entire `OAuthClientProvider.async_auth_flow` body ran under
`self.context.lock`, including the `yield request` that hands the request
off to httpx. For requests that complete quickly this is fine, but a GET
SSE long-poll holds the lock for the full SSE lifetime — which means any
concurrent POST (e.g. `tools/call`) is blocked waiting for the lock,
producing a ~16s perceived stall on lazy MCP connections that use OAuth.

This commit splits the single coarse lock into purpose-specific scopes:

  Phase 1 (context.lock): initialize state, capture protocol_version, and
    decide whether a refresh is needed. Short-held; no HTTP I/O.
  Phase 2 (refresh_lock, new): single-flight token refresh. The refresh
    request `yield` happens outside any lock. A double-check inside
    `context.lock` ensures concurrent waiters do not redundantly refresh
    after another coroutine completed one.
  Phase 3 (no lock): add the auth header and yield the actual request.
    GET SSE long-polls and other long-running requests no longer block
    unrelated traffic.
  Phase 4 (context.lock): 401 / 403 full OAuth re-auth path. Conservatively
    kept under lock because this path is rare and its yielded sub-requests
    (metadata discovery, registration, token exchange) hit the AS, not the
    resource server. A future refactor can narrow this further.

Lock additions:
  - `OAuthContext.refresh_lock: anyio.Lock` provides single-flight refresh
    so concurrent requests trigger at most one token refresh.

Behavior changes:
  - Concurrent requests through the same `OAuthClientProvider` no longer
    serialize at the lock. GET SSE long-polls and POSTs now proceed in
    parallel.
  - Token refresh remains serialized (via `refresh_lock`), preserving the
    invariant that only one refresh request is in flight at a time.
  - Public API and behavior are otherwise unchanged.

Related upstream issue:
  modelcontextprotocol#1326
…ests

Two tests in a new TestConcurrentRequestsDoNotDeadlock class exercise the
behavior the previous commit fixes:

1. ``test_concurrent_request_not_blocked_by_pending_long_running_request``
   drives one async_auth_flow generator to its yield (= simulating a
   GET SSE long-poll suspended waiting for the next event) and then opens
   a second concurrent flow on the same provider. The second flow must
   reach its own yield within a short timeout — i.e., the lock release
   between Phase 1 and Phase 3 lets it through. Pre-fix, the second
   generator would block on context.lock indefinitely.

2. ``test_concurrent_token_refresh_is_single_flight`` exercises the
   refresh_lock single-flight path. A first flow performs the refresh
   yield; a second flow started after the refresh completes observes the
   freshly-updated token in Phase 1 and proceeds directly to its own
   request yield without issuing a second refresh.

Also: tighten the refresh_request unbound-after-conditional-write pattern
in async_auth_flow so pyright recognizes it as definitely assigned at the
yield site (was: derived from a boolean predicate; now: typed as
``httpx.Request | None`` and checked explicitly).
After dropping the function-level `# pragma: no cover` from
`_handle_refresh_response` and removing the per-line pragmas from the
refactored Phase 2, the strict-no-cover audit identified covered lines
still marked pragma'd and surfaced previously-untested branches. Three
new tests close the coverage gaps:

* `test_refresh_with_failed_status_clears_tokens` — exercises the
  ``response.status_code != 200`` branch in `_handle_refresh_response`
  and the `self._initialized = False` reset on refresh failure.
* `test_refresh_with_invalid_json_clears_tokens` — exercises the
  ValidationError branch when the refresh body is not valid JSON.
* `test_double_check_inside_refresh_lock_skips_second_refresh` —
  uses monkeypatch to flip `is_token_valid` between Phase 1 (False)
  and the double-check inside `refresh_lock` (True), exercising the
  branch where a second coroutine's refresh is correctly elided.

Also: convert the new tests from the legacy Test* class pattern to
plain top-level `test_*` functions per AGENTS.md, and drop unneeded
per-line `# pragma: no cover` markers in the refactored auth_flow.

Coverage report: 100.00% on `src/mcp/client/auth/oauth2.py`,
strict-no-cover clean.
On Python 3.10/3.11, coverage.py (sys.settrace backend) misattributes
the False arm of compound boolean predicates inside an async with block
in an async generator, leading to spurious partial-branch warnings on
4 lines in async_auth_flow:
  - line 536: `if not is_token_valid() and can_refresh_token():`  (Phase 1)
  - line 546: same (Phase 2 double-check inside refresh_lock)
  - line 548: `if refresh_request is not None:`  (re-check skip path)
  - line 553: `if not await _handle_refresh_response(...):`  (refresh
    success path)

Python 3.12+ (sys.monitoring) tracks these branches correctly. Add
`# pragma: no branch` to the 4 lines as a workaround for the legacy
backend only. An inline comment block above the Phase 1 if documents
the rationale.

Also add 2 unit tests that explicitly exercise the affected branches so
the coverage drop is shown to be a tool-precision issue, not missing
tests:
  - test_phase1_skips_refresh_when_token_valid:
      Phase 1 sees is_token_valid=True and skips Phase 2 entirely.
  - test_refresh_success_proceeds_to_phase3_without_resetting_initialized:
      _handle_refresh_response returns True (HTTP 200) so the
      _initialized reset is skipped and Phase 3 proceeds with the
      fresh token.

Local verification (Python 3.10):
  pytest tests/client/test_auth.py -n auto + coverage report
  → 99.43% (was 98.30%, BrPart 1, Missing only line 206 which is
  outside the test_auth.py scope). Full suite should reach 100%.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: anyio.Lock in async_auth_flow causes RuntimeError under concurrent OAuth MCP connections

2 participants