fix: retry MCP callTool on any connection error, not just ErrSessionMissing#2215
fix: retry MCP callTool on any connection error, not just ErrSessionMissing#2215dgageot wants to merge 1 commit intodocker:mainfrom
Conversation
pkg/tools/mcp/mcp.go
Outdated
| // If the call failed with a connection or session error (e.g. the | ||
| // server restarted), trigger or wait for a reconnection and retry | ||
| // the call once. Context-cancellation errors are not retried. | ||
| if err != nil && !errors.Is(err, context.Canceled) && ctx.Err() == nil { |
There was a problem hiding this comment.
🔴 HIGH: Retry logic now retries on ALL errors, not just connection/session errors
The retry condition was broadened from checking specifically for errors.Is(err, mcp.ErrSessionMissing) to checking for any error that is not context.Canceled. This means the code will now trigger reconnection for ALL errors from CallTool, not just session-related errors.
Impact: Application-level errors like authentication failures, invalid tool names, or validation errors will trigger expensive reconnect attempts that won't fix the underlying issue. This wastes resources and adds latency to error responses.
Recommendation: Consider checking for specific connection/transport error types (e.g., net.Error, io.EOF, syscall.ECONNRESET) instead of retrying on all errors. For example:
if err != nil && isConnectionError(err) && ctx.Err() == nil {
// retry logic
}
func isConnectionError(err error) bool {
if errors.Is(err, mcp.ErrSessionMissing) {
return true
}
if errors.Is(err, io.EOF) {
return true
}
var netErr net.Error
if errors.As(err, &netErr) {
return true
}
// Check for syscall errors like ECONNRESET
return false
}| // reconnection and retry the call once. | ||
| if errors.Is(err, mcp.ErrSessionMissing) { | ||
| slog.Warn("MCP session missing, forcing reconnect and retrying", "tool", toolCall.Function.Name, "server", ts.logID) | ||
| // If the call failed with a connection or session error (e.g. the |
There was a problem hiding this comment.
🟡 MEDIUM: Comment claims to retry only connection/session errors but code retries all non-canceled errors
The comment states "If the call failed with a connection or session error (e.g. the server restarted), trigger or wait for a reconnection" but the actual code checks if err != nil && !errors.Is(err, context.Canceled) && ctx.Err() == nil, which retries on ANY error.
Impact: This mismatch between comment and implementation can mislead future maintainers about the actual behavior of the code.
Recommendation: Update the comment to accurately reflect the implementation, or update the implementation to match the comment's intent. For example:
// If the call failed with any error (except context cancellation),
// trigger or wait for a reconnection and retry the call once.Or implement selective error checking as suggested in the HIGH severity finding above.
…sionMissing When an MCP server restarts, callTool may receive transport errors (EOF, connection reset, broken pipe) instead of ErrSessionMissing depending on timing. Add an isConnectionError helper that selectively retries on known connection/session error types rather than retrying on all errors indiscriminately. The helper checks for: - mcp.ErrSessionMissing (server lost our session) - io.EOF (connection dropped) - net.Error (timeout, connection reset, etc.) - String-based fallback for transport errors the MCP SDK wraps with %v (losing the original error chain) This fixes flaky TestRemoteReconnectAfterServerRestart without broadening retry to application-level errors. Assisted-By: docker-agent
When an MCP server restarts, callTool could receive transport errors (EOF, connection reset) instead of ErrSessionMissing depending on timing. Broaden the retry condition to trigger forceReconnectAndWait on any non-context-cancellation error, fixing a flaky TestRemoteReconnectAfterServerRestart.
Assisted-By: docker-agent