Skip to content

tui: persistent toolset-failure notifications stack without dedup, masking the screen on long MCP failure streaks #2884

@aheritier

Description

@aheritier

Summary

When a remote MCP toolset enters a persistent failure state (e.g. an SSE endpoint returning toolset not started), the TUI accumulates a new persistent warning notification on the right side of the screen on every conversation turn. After a long session the cards cover most of the viewport — each must be dismissed individually with [x].

This is distinct from #2861 / #2866: #2861 is the heap-leak root cause of the jetsam kill, this one is a UX bug that surfaces every time MCP fails repeatedly. They can both bite the same session (mine did).

Reproduction

  1. Start a session with at least one remote MCP whose endpoint will keep failing on Tools() — e.g. an sse server that has accepted Initialize but later returns lifecycle.ErrNotStarted on every list, or a transport that flaps.
  2. Have a few normal conversation turns.
  3. Each turn appends a new persistent Some toolsets failed to initialize for agent '<name>'.\n\nDetails:\n\n- mcp(remote host=… transport=sse) list failed: toolset not started [x] card.

Observed across three of my real sessions (screenshots in the linked DM): 5–7 identical stacked cards, several with two repeated lines inside the same card, after a few hours of activity.

Root cause

Two cooperating gaps:

1. No once-per-streak guard on the "list failed" pathpkg/agent/agent.go:321:

ta, err := toolSet.Tools(ctx)
if err != nil {
    desc := tools.DescribeToolSet(toolSet)
    slog.WarnContext(ctx, "Toolset listing failed; skipping", ...)
    a.AddToolWarning(fmt.Sprintf("%s list failed: %v", desc, err))
    continue
}

The "start failed" path at agent.go:361 and runtime.go:1152 correctly gates emission through StartableToolSet.ShouldReportFailure() (which returns true exactly once per failure streak). The "list failed" path skips this guard, so every iteration of collectTools re-emits a fresh warning for the same underlying problem.

2. No dedup in the notification managerpkg/tui/components/notification/notification.go:131:

case ShowMsg:
    id := nextID.Add(1)
    ...
    item := notificationItem{ID: id, Text: msg.Text, Type: notifType}
    n.items = append([]notificationItem{item}, n.items...)

ShowMsg is appended unconditionally; persistent items (TypeWarning/TypeError) never auto-expire (persistent() returns true, no tea.Tick scheduled), so identical cards stack until the user clicks each [x].

The combination produces N copies of the same warning per failing toolset per session.

Proposed fix

Either side fixes the symptom, both together fix it cleanly:

  • Agent side: route collectTools errors through the same once-per-streak guard the start path uses. StartableToolSet already tracks pendingWarning for Start() failures; extend it (or add a sibling pendingListWarning) so a repeated Tools() failure for an already-started-but-now-broken toolset only surfaces once until it recovers.
  • Notification side: in Manager.Update's ShowMsg case, if a persistent notification with identical Text is already present, drop the new one (or bump a counter "× N" appended to the existing card). Cheap, defensive, and useful for any future caller that emits duplicate warnings.

I'd recommend doing both: the agent fix is the right primary, the notification fix is a safety net.

Repro environment

  • macOS, Apple silicon, 64 GB
  • docker-agent HEAD as of 2026-05-22
  • Multi-agent config with several remote MCPs (Notion SSE, an internal streamable MCP)
  • Sessions of several hours; warnings observed across multiple agents (mark_iv, root)

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/mcpMCP protocol, MCP tool servers, integrationarea/tuiFor features/issues/fixes related to the TUI

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions