Skip to content

Fix stateless HTTP task accumulation causing memory leak#2145

Open
wiggzz wants to merge 12 commits intomodelcontextprotocol:mainfrom
wiggzz:fix/stateless-task-group-leak
Open

Fix stateless HTTP task accumulation causing memory leak#2145
wiggzz wants to merge 12 commits intomodelcontextprotocol:mainfrom
wiggzz:fix/stateless-task-group-leak

Conversation

@wiggzz
Copy link

@wiggzz wiggzz commented Feb 25, 2026

Summary

  • Fix memory leak where run_stateless_server tasks accumulate indefinitely in the manager's global task group after client disconnects
  • Use request-scoped task groups instead of the global _task_group for stateless requests, ensuring tasks are cancelled when requests complete
  • Add regression test that reproduces the real-world leak scenario with zero mocks in the core flow

Motivation and Context

Related to #1764 (the underlying task accumulation issue in stateless mode).

Root cause

When a client disconnects while a tool call is in progress, EventSourceResponse detects the disconnect and tears down the SSE stream pipeline. However, Server.run() — which is processing the tool call in a separate task spawned into the manager's global _task_group — continues running. After the tool handler completes, the response has nowhere to go (the request streams were cleaned up), and Server.run() blocks indefinitely on async for message in session.incoming_messages. These zombie tasks accumulate one per disconnected request.

In production, we observed 449 leaked ServerSession objects, 66,086 orphaned coroutines, 4,463 blocked stream receive operations, and ~150 MB of leaked heap — all from this pattern.

Why the existing test didn't catch this

The existing test (test_stateless_requests_memory_cleanup, added in PR #1140) mocks app.run to return immediately, so the spawned tasks complete naturally. It never exercises the real-world scenario where client disconnects leave app.run() blocked.

The fix

Replace the global task group spawn with a request-scoped anyio.create_task_group(). Both the server task and the request handler run inside this scoped group. When the request handler finishes (or is cancelled due to client disconnect), the scope is cancelled, which automatically cancels the server task. No tasks leak into the manager's long-lived task group.

Testing

The new test (test_stateless_requests_task_leak_on_client_disconnect) uses:

  • Real Server.run() with a real tool handler (no mock)
  • Real SSE streaming via httpx.ASGITransport
  • Simulated client disconnect by cancelling the request task while the tool is executing
  • Verifies that after 3 disconnected requests, zero tasks remain in the global task group

All 12 tests in the test file pass.

Breaking changes

None. This is an internal implementation change. The public API and behavior are unchanged.

Checklist

  • I have read the MCP Documentation
  • My code follows the repository's style guidelines
  • New and existing tests pass locally
  • I have added appropriate error handling
  • I have added or updated documentation as needed

🤖 Generated with Claude Code

@wiggzz
Copy link
Author

wiggzz commented Feb 25, 2026

This isn't ready for review yet, as this is just a first pass from Claude. Currently testing this.

@wiggzz wiggzz force-pushed the fix/stateless-task-group-leak branch from b63de94 to 67d02d3 Compare February 25, 2026 16:53
…tprotocol#756)

In stateless mode, each request spawned a `run_stateless_server` task
into the manager's global `_task_group`. After `handle_request()`
completed and `terminate()` was called, the task continued running
inside `app.run()`, blocked on `async for message in
session.incoming_messages`. These zombie tasks accumulated indefinitely,
leaking memory.

Replace the global task group spawn with a request-scoped task group
so that server tasks are automatically cancelled when their request
completes.

Add a regression test that simulates the real blocking behavior of
`app.run()` using `anyio.sleep_forever()` and verifies no tasks
linger in the global task group after requests finish.

Closes modelcontextprotocol#756

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@wiggzz wiggzz force-pushed the fix/stateless-task-group-leak branch from 67d02d3 to a7a340d Compare February 25, 2026 17:00
wiggzz and others added 11 commits February 25, 2026 11:07
…ntextprotocol#1764 (open)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add anyio.sleep(0) between terminate() and cancel_scope.cancel() so
  the message router can observe the ClosedResourceError from the closed
  streams before the task group is cancelled (restores line 1022 coverage)
- Remove handle_list_tools from test — it was never called (only
  call_tool is exercised), so its return was an uncovered line
- Mark the CallToolResult return as pragma: no cover — intentionally
  unreachable in this test because the task is always cancelled at
  await tool_gate.wait() before the return executes
- Collapse multi-line function signatures to satisfy ruff-format

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The sleep(0) yield in the stateless request handler allows the event
loop to exercise exception paths that were previously unreachable with
mocked tests. These paths are legitimately reachable and the more
realistic test now covers them consistently:

- streamable_http_manager.py: except Exception in run_stateless_server
- streamable_http.py: except Exception in SSE pipeline (x2)
- streamable_http.py: if not self.mcp_session_id (always true in stateless mode)
- session.py: except Exception in receive loop cleanup

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The sleep(0) between terminate() and cancel_scope.cancel() was
non-deterministic: sometimes the event loop ran exception handlers
(covering the pragmas), sometimes the cancel won the race (leaving
lines uncovered). This caused strict-no-cover to flip between
"wrongly marked" and "missing coverage" across CI runs.

Revert to the simple, deterministic approach:
- Restore all original pragma: no cover annotations
- Add pragma: no cover for the ClosedResourceError/_terminated path
  in the message router, which is unreachable with request-scoped
  task groups (Cancelled always wins over ClosedResourceError)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The if-branch check itself is reachable; it's the body that isn't.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Cancel the request-scoped task group first so the Cancelled exception
deterministically reaches the server task before terminate() closes the
streams. This avoids a race between Cancelled and ClosedResourceError
in the message router.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Change 5 pragma annotations from strict "no cover" to "lax no cover"
since these exception handlers may or may not fire depending on
cancellation timing. Fix ruff-format in test file.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Mark 4 test assertions as lax no cover — coverage.py on Python 3.11
falsely reports them as uncovered despite the test passing. Also fix
ruff format for assert message and async with blocks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add None check assertion and type: ignore for accessing private
_tasks attribute on TaskGroup (needed to verify no leaked tasks).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant