Skip to content

Conversation

@algal
Copy link

@algal algal commented Jan 19, 2026

This PR fixes a crash in NetworkTransport.connect() where a delayed reconnection task could resume a stale continuation after a new connect attempt started, causing a crash.

Motivation and Context

The bug occurs when a delayed reconnection task from a previous connect attempt runs after a new connect attempt has started.

While using iMCP.app, I saw that app crash when a client (imcp-server) started and then died quickly. Crash logs showed CheckedContinuation.resume(throwing:) from NetworkTransport.handleReconnection. What was happening was that when NetworkTransport.connect() is called, it uses a checked continuation and schedules reconnection work with backoff on failure/cancellation. If a reconnect attempt is scheduled and a new connect() call happens before that delayed task fires, the old task can still resume the previous continuation. That double‑resumes a CheckedContinuation and traps (EXC_BREAKPOINT / SIGTRAP).

How Has This Been Tested?

Trying this with the iMCP.app is the most easy way to repro the crash, and what I used to ensure that the fix removed the crash. You just run iMCP, then try running its imcp-server and CTRL-C it immmediately a few times. Eventually, this brings down the iMCP process as well.

I have not produced a test which reproduces the bug in isolation, outside of iMCP, but I can do so if that would help. The sequence would work as follows

Repro (before fix):

  1. Create a NetworkTransport (default reconnection config).
  2. Call connect(), then cancel the underlying connection quickly enough to enter handleReconnection.
  3. Call connect() again before the reconnection backoff delay elapses.
  4. When the delayed task fires, it resumes the old continuation and traps (EXC_BREAKPOINT / SIGTRAP).

I found that swift test fails on Swift 6.2.3 due to existing strict-concurrency errors in Tests/MCPTests/ClientTests.swift.

How this fix works

The continuation‑resume guard was a single boolean reset on each connect() call. Delayed reconnection tasks from a previous connect attempt could still run after a new connect had reset the guard, allowing a second resume on the old continuation.

This fix works as follows:

  • Introduce a per‑connect UUID (connectContinuationID).
  • Thread that ID through handleConnectionReady/Failed/Cancelled and handleReconnection.
  • Resume only if the ID matches the current connect attempt, and invalidate it after resuming.
  • Ignore delayed reconnection tasks from stale attempts.

Breaking Changes

None.

Besides the fix, behavior is unchanged; this only prevents stale tasks from resolving an old continuation.

This is an internal-only change; the only observable difference is that stale reconnection tasks no longer resolve a previous connect() attempt.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update

Checklist

  • I have read the MCP Documentation
  • My code follows the repository's style guidelines
  • New and existing tests pass locally
  • I have added appropriate error handling
  • I have added or updated documentation as needed

Additional context on use of AI

I (algal) have reviewed this PR, performed the before/after tests above, and I believe the analysis I have stated above is correct. I also have a background in Swift and of native development. However, I am not deeply familiar with this code base, and I did rely on AI in the developing this fix and drafting this PR text.

I know there's a lot of slop PRs flying around these days, so I wanted to be transparent about that. I am happy to be corrected or steered

@DePasqualeOrg
Copy link

This issue, along with many others, has already been resolved in my fork, which @movetz, @stallent, and I are discussing merging into this repository.

Instead of using a checked continuation with callback-based state handling (which requires the UUID-based guard proposed in this PR), my fork uses AsyncStream to bridge NWConnection state updates:

private func waitForConnectionReady() async throws {
    let stateStream = AsyncStream<NWConnection.State> { continuation in
        connection.stateUpdateHandler = { state in
            continuation.yield(state)
            switch state {
            case .ready, .failed, .cancelled:
                continuation.finish()
            default:
                break
            }
        }
    }

    for await state in stateStream {
        switch state {
        case .ready: return
        case .failed(let error): throw error
        case .cancelled: throw MCPError.internalError("Connection cancelled")
        // ...
        }
    }
}

This approach inherently avoids the double-resume race condition because:

  • A fresh stream is created for each connect() call.
  • The stream finishes on terminal states, leaving no lingering continuation.
  • My fork removes in-connect reconnection scheduling entirely — on failure, it throws immediately rather than scheduling a delayed retry. Reconnection is handled in receiveLoop() for post-connect failures.

@algal
Copy link
Author

algal commented Jan 20, 2026

@DePasqualeOrg Cool! It sounds like your solution is better. I hope it is merged. :)

@DePasqualeOrg
Copy link

DePasqualeOrg commented Jan 20, 2026

It won't be, unless it's replicated by someone else, because the new maintainers of this package want to take a patchwork approach to improvements. My full-featured fork will continue to be developed separately here: DePasqualeOrg/swift-mcp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants