Guard NetworkTransport connect continuation against stale reconnectio… #178
+77
−13
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR fixes a crash in
NetworkTransport.connect()where a delayed reconnection task could resume a stale continuation after a new connect attempt started, causing a crash.Motivation and Context
The bug occurs when a delayed reconnection task from a previous connect attempt runs after a new connect attempt has started.
While using iMCP.app, I saw that app crash when a client (
imcp-server) started and then died quickly. Crash logs showedCheckedContinuation.resume(throwing:)fromNetworkTransport.handleReconnection. What was happening was that whenNetworkTransport.connect()is called, it uses a checked continuation and schedules reconnection work with backoff on failure/cancellation. If a reconnect attempt is scheduled and a newconnect()call happens before that delayed task fires, the old task can still resume the previous continuation. That double‑resumes aCheckedContinuationand traps (EXC_BREAKPOINT / SIGTRAP).How Has This Been Tested?
Trying this with the iMCP.app is the most easy way to repro the crash, and what I used to ensure that the fix removed the crash. You just run iMCP, then try running its
imcp-serverand CTRL-C it immmediately a few times. Eventually, this brings down the iMCP process as well.I have not produced a test which reproduces the bug in isolation, outside of iMCP, but I can do so if that would help. The sequence would work as follows
Repro (before fix):
NetworkTransport(default reconnection config).connect(), then cancel the underlying connection quickly enough to enterhandleReconnection.connect()again before the reconnection backoff delay elapses.EXC_BREAKPOINT / SIGTRAP).I found that
swift testfails on Swift 6.2.3 due to existing strict-concurrency errors inTests/MCPTests/ClientTests.swift.How this fix works
The continuation‑resume guard was a single boolean reset on each
connect()call. Delayed reconnection tasks from a previous connect attempt could still run after a new connect had reset the guard, allowing a second resume on the old continuation.This fix works as follows:
connectContinuationID).handleConnectionReady/Failed/CancelledandhandleReconnection.Breaking Changes
None.
Besides the fix, behavior is unchanged; this only prevents stale tasks from resolving an old continuation.
This is an internal-only change; the only observable difference is that stale reconnection tasks no longer resolve a previous
connect()attempt.Types of changes
Checklist
Additional context on use of AI
I (algal) have reviewed this PR, performed the before/after tests above, and I believe the analysis I have stated above is correct. I also have a background in Swift and of native development. However, I am not deeply familiar with this code base, and I did rely on AI in the developing this fix and drafting this PR text.
I know there's a lot of slop PRs flying around these days, so I wanted to be transparent about that. I am happy to be corrected or steered