tui: persistent toolset-failure notifications stack without dedup, masking the screen on long MCP failure streaks

## Summary

When a remote MCP toolset enters a persistent failure state (e.g. an SSE endpoint returning `toolset not started`), the TUI accumulates a new persistent warning notification on the right side of the screen on every conversation turn. After a long session the cards cover most of the viewport — each must be dismissed individually with `[x]`.

This is **distinct** from #2861 / #2866: #2861 is the heap-leak root cause of the jetsam kill, this one is a UX bug that surfaces every time MCP fails repeatedly. They can both bite the same session (mine did).

## Reproduction

1. Start a session with at least one remote MCP whose endpoint will keep failing on `Tools()` — e.g. an `sse` server that has accepted Initialize but later returns `lifecycle.ErrNotStarted` on every list, or a transport that flaps.
2. Have a few normal conversation turns.
3. Each turn appends a new persistent `Some toolsets failed to initialize for agent '<name>'.\n\nDetails:\n\n- mcp(remote host=… transport=sse) list failed: toolset not started [x]` card.

Observed across three of my real sessions (screenshots in the linked DM): 5–7 identical stacked cards, several with two repeated lines inside the same card, after a few hours of activity.

## Root cause

Two cooperating gaps:

**1. No once-per-streak guard on the "list failed" path** — `pkg/agent/agent.go:321`:

```go
ta, err := toolSet.Tools(ctx)
if err != nil {
    desc := tools.DescribeToolSet(toolSet)
    slog.WarnContext(ctx, "Toolset listing failed; skipping", ...)
    a.AddToolWarning(fmt.Sprintf("%s list failed: %v", desc, err))
    continue
}
```

The "start failed" path at `agent.go:361` and `runtime.go:1152` correctly gates emission through `StartableToolSet.ShouldReportFailure()` (which returns `true` exactly once per failure streak). The "list failed" path skips this guard, so every iteration of `collectTools` re-emits a fresh warning for the same underlying problem.

**2. No dedup in the notification manager** — `pkg/tui/components/notification/notification.go:131`:

```go
case ShowMsg:
    id := nextID.Add(1)
    ...
    item := notificationItem{ID: id, Text: msg.Text, Type: notifType}
    n.items = append([]notificationItem{item}, n.items...)
```

`ShowMsg` is appended unconditionally; persistent items (`TypeWarning`/`TypeError`) never auto-expire (`persistent()` returns `true`, no `tea.Tick` scheduled), so identical cards stack until the user clicks each `[x]`.

The combination produces N copies of the same warning per failing toolset per session.

## Proposed fix

Either side fixes the symptom, both together fix it cleanly:

- **Agent side:** route `collectTools` errors through the same once-per-streak guard the start path uses. `StartableToolSet` already tracks `pendingWarning` for `Start()` failures; extend it (or add a sibling `pendingListWarning`) so a repeated `Tools()` failure for an already-started-but-now-broken toolset only surfaces once until it recovers.
- **Notification side:** in `Manager.Update`'s `ShowMsg` case, if a persistent notification with identical `Text` is already present, drop the new one (or bump a counter "× N" appended to the existing card). Cheap, defensive, and useful for any future caller that emits duplicate warnings.

I'd recommend doing both: the agent fix is the right primary, the notification fix is a safety net.

## Repro environment

- macOS, Apple silicon, 64 GB
- `docker-agent` HEAD as of 2026-05-22
- Multi-agent config with several remote MCPs (Notion SSE, an internal streamable MCP)
- Sessions of several hours; warnings observed across multiple agents (`mark_iv`, `root`)

## Related

- #2861 — heap leak per assistant message (memory growth → jetsam kill). Independent root cause; this issue is purely the TUI surface area for repeated MCP failures.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tui: persistent toolset-failure notifications stack without dedup, masking the screen on long MCP failure streaks #2884

Summary

Reproduction

Root cause

Proposed fix

Repro environment

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

tui: persistent toolset-failure notifications stack without dedup, masking the screen on long MCP failure streaks #2884

Description

Summary

Reproduction

Root cause

Proposed fix

Repro environment

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions