Fix stateless HTTP task accumulation causing memory leak#2145
Open
wiggzz wants to merge 12 commits intomodelcontextprotocol:mainfrom
Open
Fix stateless HTTP task accumulation causing memory leak#2145wiggzz wants to merge 12 commits intomodelcontextprotocol:mainfrom
wiggzz wants to merge 12 commits intomodelcontextprotocol:mainfrom
Conversation
Author
|
This isn't ready for review yet, as this is just a first pass from Claude. Currently testing this. |
b63de94 to
67d02d3
Compare
…tprotocol#756) In stateless mode, each request spawned a `run_stateless_server` task into the manager's global `_task_group`. After `handle_request()` completed and `terminate()` was called, the task continued running inside `app.run()`, blocked on `async for message in session.incoming_messages`. These zombie tasks accumulated indefinitely, leaking memory. Replace the global task group spawn with a request-scoped task group so that server tasks are automatically cancelled when their request completes. Add a regression test that simulates the real blocking behavior of `app.run()` using `anyio.sleep_forever()` and verifies no tasks linger in the global task group after requests finish. Closes modelcontextprotocol#756 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
67d02d3 to
a7a340d
Compare
…ntextprotocol#1764 (open) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add anyio.sleep(0) between terminate() and cancel_scope.cancel() so the message router can observe the ClosedResourceError from the closed streams before the task group is cancelled (restores line 1022 coverage) - Remove handle_list_tools from test — it was never called (only call_tool is exercised), so its return was an uncovered line - Mark the CallToolResult return as pragma: no cover — intentionally unreachable in this test because the task is always cancelled at await tool_gate.wait() before the return executes - Collapse multi-line function signatures to satisfy ruff-format Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The sleep(0) yield in the stateless request handler allows the event loop to exercise exception paths that were previously unreachable with mocked tests. These paths are legitimately reachable and the more realistic test now covers them consistently: - streamable_http_manager.py: except Exception in run_stateless_server - streamable_http.py: except Exception in SSE pipeline (x2) - streamable_http.py: if not self.mcp_session_id (always true in stateless mode) - session.py: except Exception in receive loop cleanup Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The sleep(0) between terminate() and cancel_scope.cancel() was non-deterministic: sometimes the event loop ran exception handlers (covering the pragmas), sometimes the cancel won the race (leaving lines uncovered). This caused strict-no-cover to flip between "wrongly marked" and "missing coverage" across CI runs. Revert to the simple, deterministic approach: - Restore all original pragma: no cover annotations - Add pragma: no cover for the ClosedResourceError/_terminated path in the message router, which is unreachable with request-scoped task groups (Cancelled always wins over ClosedResourceError) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The if-branch check itself is reachable; it's the body that isn't. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Cancel the request-scoped task group first so the Cancelled exception deterministically reaches the server task before terminate() closes the streams. This avoids a race between Cancelled and ClosedResourceError in the message router. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Change 5 pragma annotations from strict "no cover" to "lax no cover" since these exception handlers may or may not fire depending on cancellation timing. Fix ruff-format in test file. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Mark 4 test assertions as lax no cover — coverage.py on Python 3.11 falsely reports them as uncovered despite the test passing. Also fix ruff format for assert message and async with blocks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add None check assertion and type: ignore for accessing private _tasks attribute on TaskGroup (needed to verify no leaked tasks). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
run_stateless_servertasks accumulate indefinitely in the manager's global task group after client disconnects_task_groupfor stateless requests, ensuring tasks are cancelled when requests completeMotivation and Context
Related to #1764 (the underlying task accumulation issue in stateless mode).
Root cause
When a client disconnects while a tool call is in progress,
EventSourceResponsedetects the disconnect and tears down the SSE stream pipeline. However,Server.run()— which is processing the tool call in a separate task spawned into the manager's global_task_group— continues running. After the tool handler completes, the response has nowhere to go (the request streams were cleaned up), andServer.run()blocks indefinitely onasync for message in session.incoming_messages. These zombie tasks accumulate one per disconnected request.In production, we observed 449 leaked
ServerSessionobjects, 66,086 orphaned coroutines, 4,463 blocked stream receive operations, and ~150 MB of leaked heap — all from this pattern.Why the existing test didn't catch this
The existing test (
test_stateless_requests_memory_cleanup, added in PR #1140) mocksapp.runto return immediately, so the spawned tasks complete naturally. It never exercises the real-world scenario where client disconnects leaveapp.run()blocked.The fix
Replace the global task group spawn with a request-scoped
anyio.create_task_group(). Both the server task and the request handler run inside this scoped group. When the request handler finishes (or is cancelled due to client disconnect), the scope is cancelled, which automatically cancels the server task. No tasks leak into the manager's long-lived task group.Testing
The new test (
test_stateless_requests_task_leak_on_client_disconnect) uses:Server.run()with a real tool handler (no mock)httpx.ASGITransportAll 12 tests in the test file pass.
Breaking changes
None. This is an internal implementation change. The public API and behavior are unchanged.
Checklist
🤖 Generated with Claude Code