Skip to content

fix: retry MCP callTool on any connection error, not just ErrSessionMissing#2215

Draft
dgageot wants to merge 1 commit intodocker:mainfrom
dgageot:fix-flaky-test
Draft

fix: retry MCP callTool on any connection error, not just ErrSessionMissing#2215
dgageot wants to merge 1 commit intodocker:mainfrom
dgageot:fix-flaky-test

Conversation

@dgageot
Copy link
Member

@dgageot dgageot commented Mar 22, 2026

When an MCP server restarts, callTool could receive transport errors (EOF, connection reset) instead of ErrSessionMissing depending on timing. Broaden the retry condition to trigger forceReconnectAndWait on any non-context-cancellation error, fixing a flaky TestRemoteReconnectAfterServerRestart.

Assisted-By: docker-agent

@dgageot dgageot requested a review from a team as a code owner March 22, 2026 18:11
Copy link

@docker-agent docker-agent bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assessment: 🔴 CRITICAL

Found 2 issues in the retry logic changes:

  • 1 HIGH severity issue
  • 1 MEDIUM severity issue

// If the call failed with a connection or session error (e.g. the
// server restarted), trigger or wait for a reconnection and retry
// the call once. Context-cancellation errors are not retried.
if err != nil && !errors.Is(err, context.Canceled) && ctx.Err() == nil {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 HIGH: Retry logic now retries on ALL errors, not just connection/session errors

The retry condition was broadened from checking specifically for errors.Is(err, mcp.ErrSessionMissing) to checking for any error that is not context.Canceled. This means the code will now trigger reconnection for ALL errors from CallTool, not just session-related errors.

Impact: Application-level errors like authentication failures, invalid tool names, or validation errors will trigger expensive reconnect attempts that won't fix the underlying issue. This wastes resources and adds latency to error responses.

Recommendation: Consider checking for specific connection/transport error types (e.g., net.Error, io.EOF, syscall.ECONNRESET) instead of retrying on all errors. For example:

if err != nil && isConnectionError(err) && ctx.Err() == nil {
    // retry logic
}

func isConnectionError(err error) bool {
    if errors.Is(err, mcp.ErrSessionMissing) {
        return true
    }
    if errors.Is(err, io.EOF) {
        return true
    }
    var netErr net.Error
    if errors.As(err, &netErr) {
        return true
    }
    // Check for syscall errors like ECONNRESET
    return false
}

// reconnection and retry the call once.
if errors.Is(err, mcp.ErrSessionMissing) {
slog.Warn("MCP session missing, forcing reconnect and retrying", "tool", toolCall.Function.Name, "server", ts.logID)
// If the call failed with a connection or session error (e.g. the
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MEDIUM: Comment claims to retry only connection/session errors but code retries all non-canceled errors

The comment states "If the call failed with a connection or session error (e.g. the server restarted), trigger or wait for a reconnection" but the actual code checks if err != nil && !errors.Is(err, context.Canceled) && ctx.Err() == nil, which retries on ANY error.

Impact: This mismatch between comment and implementation can mislead future maintainers about the actual behavior of the code.

Recommendation: Update the comment to accurately reflect the implementation, or update the implementation to match the comment's intent. For example:

// If the call failed with any error (except context cancellation),
// trigger or wait for a reconnection and retry the call once.

Or implement selective error checking as suggested in the HIGH severity finding above.

@dgageot dgageot marked this pull request as draft March 22, 2026 18:52
…sionMissing

When an MCP server restarts, callTool may receive transport errors
(EOF, connection reset, broken pipe) instead of ErrSessionMissing
depending on timing. Add an isConnectionError helper that selectively
retries on known connection/session error types rather than retrying
on all errors indiscriminately.

The helper checks for:
- mcp.ErrSessionMissing (server lost our session)
- io.EOF (connection dropped)
- net.Error (timeout, connection reset, etc.)
- String-based fallback for transport errors the MCP SDK wraps
  with %v (losing the original error chain)

This fixes flaky TestRemoteReconnectAfterServerRestart without
broadening retry to application-level errors.

Assisted-By: docker-agent
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant