✅ CONFIRMED ROOT CAUSE (2026-06-15)
Captured data (PTY disk logs + a 42,143-line VSCode terminal log + code trace) confirms the mechanism end to end. The original hypotheses below (Tower-wide WS pump degradation, broadcast-list leak, socket backpressure) are superseded — the real cause is an oversized-replay reconnect storm caused by two independent defects that chain:
Tower/shellper side — the root cause
Two terminal buffers bound themselves by newline count, so a full-screen TUI that redraws in place (alternate screen buffer, cursor-addressing + \r, almost no \n) makes them grow without limit:
RingBuffer.partial (packages/codev/src/terminal/ring-buffer.ts:36-51): combined = partial + data; combined.split('\n') every frame, keeping the trailing fragment. No \n → partial grows unbounded and every frame re-scans the whole thing (O(partial)/frame → O(n²)/session). No byte cap.
ShellperReplayBuffer (packages/codev/src/terminal/shellper-replay-buffer.ts:45,58, fed at shellper-process.ts:143): evicts only while lineCount > maxLines, so with zero newlines it never evicts.
On every (re)connect Tower sends the buffer as one frame: ws.send(encodeData(getAll().join('\n'))) (tower-websocket.ts:62-65) → a multi-MB replay payload.
VSCode side — the amplifier
The terminal adapter's backpressure guard trips on the replay frame itself and reconnects with no backoff:
handleData (packages/vscode/src/terminal-adapter.ts:300-308): queuedBytes += payload.length; if (queuedBytes > MAX_QUEUE /*1 MB*/) reconnect() — fires before rendering anything.
reconnect() (terminal-adapter.ts:281-296) calls backoff.reset() and reconnects instantly → Tower re-sends the same oversized replay → infinite loop.
Why it presents the way it does
- CPU ~93% (one core), all terminals freeze at once: Tower re-serializes a multi-MB replay thousands of times on its single event loop, starving every other terminal. Input stays responsive (tiny), output is frozen — matching the stray-input-char screenshot.
- Memory barely moves: the cost is CPU + GC churn, not retained memory.
- Restart is NOT reliable (corrects the original title):
ShellperReplayBuffer is also unbounded, so after restart the shellper replays the multi-MB blob, Tower re-seeds an oversized partial, and the storm resumes — unless the offending session has gone quiet/ended.
Captured evidence
- Storming terminal
f2dc55d1 disk log: 1.9 MB, 0 newlines → replay 1.9 MB > 1 MB MAX_QUEUE.
- Incident-window log
f02bedcb: 15 MB, 0 newlines (1.5M ESC, 164K CR, 0 LF — a Claude alt-screen redraw stream). Normal sessions: ~20 bytes/line.
- VSCode log (~58 min): 14,026 "WebSocket connected" / 14,015 "Backpressure exceeded 1 MB" / 14,026 "Connecting to" (1:1:1); 14,017 of the reconnects target the single terminal
f2dc55d1.
- Threshold is low: only ~1 MB of no-newline output triggers the storm.
Fix (see codev/plans/1047-... on builder/pir-1047)
Three coordinated fixes: (A) RingBuffer.pushData scans only new data + byte-caps partial (< MAX_QUEUE); (B) ShellperReplayBuffer gains a byte cap; (C) the VSCode backpressure path must not infinite-loop on an oversized replay (backoff + re-trip guard, ideally excluding the replay burst from the live-backpressure budget). A and B remove the oversized replay at the source; C makes the client structurally immune. Scope now spans the terminal core and the VSCode extension (re-labeled area/cross-cutting).
Original report (investigation history — hypotheses superseded by the confirmed cause above)
Symptom
Over time, architect AND builder PTY terminals stop rendering output and accepting input that propagates back to the visible UI. The symptom affects ALL terminals at once, not individual sessions. The only known recovery is afx tower stop && afx tower start (the default non-force variant).
Recent screenshot of an affected architect terminal: only Ran 1 shell command remained rendered, plus a stray input character (the user's e) at the bottom-left corner where the prompt frame should be. The TUI box-drawing of the prompt panel was gone, but the typed character + cursor block still appeared, suggesting the input side still propagates but the render-back side is broken.
What the diagnosis rules out
Per #1030 / PR #1031 (just landed), the default afx tower stop does NOT kill shellpers. Shellpers and the AI CLI processes they host survive the restart. Yet restart restores responsiveness, so:
- The AI CLI is not hung (it would still be hung after a non-force restart).
- The shellper process is not crashed (it survives the restart too).
- The PTY itself is not dead.
What actually changes during the default-stop / start cycle:
- Tower's WebSocket connections drop.
- The VSCode extension's terminal client detects disconnect and reconnects to the new Tower process.
- New Tower runs
reconcileTerminalSessions, reattaches to the still-alive shellpers, and resumes streaming.
So the bug lives in the Tower-to-client streaming layer or the per-terminal client-list state that Tower maintains, not in the AI CLI / shellper / PTY.
Tower-side log behavior
Tower itself stays "alive" in the process sense during the hang: SSE heartbeats keep firing every 30s, cron tasks keep running, the team scanner keeps iterating. So this is not a process-wide event-loop deadlock. It's specifically the WS-to-terminal pump that's wedged.
Top hypotheses (ordered by fit with the all-terminals-at-once pattern)
- Tower-wide WS send-pump degradation. Memory growth, GC stalls, or an unbounded handler/listener leak that gradually starves the WS output loop. Affects all terminals because they share the same Tower process. Possibly worsened by repeated errors that throw past their handler scope without cleaning up (the cron evaluator
ReferenceError filed separately is a candidate contributor).
- Leaked closed-connection entry in the per-terminal broadcast list. Tower iterates its client list to fan out PTY output; a dead WS still in the list blocks the broadcast for everyone. Restart drops the list.
- Shellper-to-Tower unix-socket backpressure. Shellper streams output to Tower over its unix socket; if Tower stops draining fast enough, shellper's pipe fills and stops producing. Restart drops the connection, shellper opens a fresh one on reattach.
Diagnostic plan for the next occurrence
- While hung, observe
afx tower log -f — does the SSE heartbeat keep firing every 30s? If yes, event loop is alive; the WS pump is the culprit (H1 or H2). If no, deeper event-loop starvation.
ps -p <tower-pid> -o pid,vsz,rss,pcpu while hung. RSS growth into multi-GB territory points at H1.
lsof -p <tower-pid> | wc -l while hung. A climbing fd count points at leaked connections (H2-adjacent).
- From a side terminal:
afx send architect "echo ping-from-side". Watch the Tower log for the send event AND watch the affected architect tab for the echoed output. If Tower logs the send but the tab never shows ping-from-side, it's pinned to the Tower-to-VSCode WS direction (not the input side).
- Open the same architect session in the Tower dashboard at
http://localhost:4100. If the dashboard renders fresh output while the VSCode tab stays frozen, the VSCode-side xterm.js renderer is in the loop too (downgrades H1/H2 to "VSCode-extension-specific WS handling" instead of "Tower-side").
Why this is PIR, not BUGFIX
Root cause is not yet known — three plausible mechanisms with overlapping symptoms. Investigation phase (capture diagnostics, narrow to a single mechanism, then design the fix) is the right shape. Pre-PR running-on-real-workload verification is load-bearing because the bug only manifests after time-in-use.
Out of scope
✅ CONFIRMED ROOT CAUSE (2026-06-15)
Captured data (PTY disk logs + a 42,143-line VSCode terminal log + code trace) confirms the mechanism end to end. The original hypotheses below (Tower-wide WS pump degradation, broadcast-list leak, socket backpressure) are superseded — the real cause is an oversized-replay reconnect storm caused by two independent defects that chain:
Tower/shellper side — the root cause
Two terminal buffers bound themselves by newline count, so a full-screen TUI that redraws in place (alternate screen buffer, cursor-addressing +
\r, almost no\n) makes them grow without limit:RingBuffer.partial(packages/codev/src/terminal/ring-buffer.ts:36-51):combined = partial + data; combined.split('\n')every frame, keeping the trailing fragment. No\n→partialgrows unbounded and every frame re-scans the whole thing (O(partial)/frame → O(n²)/session). No byte cap.ShellperReplayBuffer(packages/codev/src/terminal/shellper-replay-buffer.ts:45,58, fed atshellper-process.ts:143): evicts onlywhile lineCount > maxLines, so with zero newlines it never evicts.On every (re)connect Tower sends the buffer as one frame:
ws.send(encodeData(getAll().join('\n')))(tower-websocket.ts:62-65) → a multi-MB replay payload.VSCode side — the amplifier
The terminal adapter's backpressure guard trips on the replay frame itself and reconnects with no backoff:
handleData(packages/vscode/src/terminal-adapter.ts:300-308):queuedBytes += payload.length; if (queuedBytes > MAX_QUEUE /*1 MB*/) reconnect()— fires before rendering anything.reconnect()(terminal-adapter.ts:281-296) callsbackoff.reset()and reconnects instantly → Tower re-sends the same oversized replay → infinite loop.Why it presents the way it does
ShellperReplayBufferis also unbounded, so after restart the shellper replays the multi-MB blob, Tower re-seeds an oversizedpartial, and the storm resumes — unless the offending session has gone quiet/ended.Captured evidence
f2dc55d1disk log: 1.9 MB, 0 newlines → replay 1.9 MB > 1 MBMAX_QUEUE.f02bedcb: 15 MB, 0 newlines (1.5M ESC, 164K CR, 0 LF — a Claude alt-screen redraw stream). Normal sessions: ~20 bytes/line.f2dc55d1.Fix (see
codev/plans/1047-...onbuilder/pir-1047)Three coordinated fixes: (A)
RingBuffer.pushDatascans only new data + byte-capspartial(<MAX_QUEUE); (B)ShellperReplayBuffergains a byte cap; (C) the VSCode backpressure path must not infinite-loop on an oversized replay (backoff + re-trip guard, ideally excluding the replay burst from the live-backpressure budget). A and B remove the oversized replay at the source; C makes the client structurally immune. Scope now spans the terminal core and the VSCode extension (re-labeledarea/cross-cutting).Original report (investigation history — hypotheses superseded by the confirmed cause above)
Symptom
Over time, architect AND builder PTY terminals stop rendering output and accepting input that propagates back to the visible UI. The symptom affects ALL terminals at once, not individual sessions. The only known recovery is
afx tower stop && afx tower start(the default non-force variant).Recent screenshot of an affected architect terminal: only
Ran 1 shell commandremained rendered, plus a stray input character (the user'se) at the bottom-left corner where the prompt frame should be. The TUI box-drawing of the prompt panel was gone, but the typed character + cursor block still appeared, suggesting the input side still propagates but the render-back side is broken.What the diagnosis rules out
Per #1030 / PR #1031 (just landed), the default
afx tower stopdoes NOT kill shellpers. Shellpers and the AI CLI processes they host survive the restart. Yet restart restores responsiveness, so:What actually changes during the default-stop / start cycle:
reconcileTerminalSessions, reattaches to the still-alive shellpers, and resumes streaming.So the bug lives in the Tower-to-client streaming layer or the per-terminal client-list state that Tower maintains, not in the AI CLI / shellper / PTY.
Tower-side log behavior
Tower itself stays "alive" in the process sense during the hang: SSE heartbeats keep firing every 30s, cron tasks keep running, the team scanner keeps iterating. So this is not a process-wide event-loop deadlock. It's specifically the WS-to-terminal pump that's wedged.
Top hypotheses (ordered by fit with the all-terminals-at-once pattern)
ReferenceErrorfiled separately is a candidate contributor).Diagnostic plan for the next occurrence
afx tower log -f— does the SSE heartbeat keep firing every 30s? If yes, event loop is alive; the WS pump is the culprit (H1 or H2). If no, deeper event-loop starvation.ps -p <tower-pid> -o pid,vsz,rss,pcpuwhile hung. RSS growth into multi-GB territory points at H1.lsof -p <tower-pid> | wc -lwhile hung. A climbing fd count points at leaked connections (H2-adjacent).afx send architect "echo ping-from-side". Watch the Tower log for the send event AND watch the affected architect tab for the echoed output. If Tower logs the send but the tab never showsping-from-side, it's pinned to the Tower-to-VSCode WS direction (not the input side).http://localhost:4100. If the dashboard renders fresh output while the VSCode tab stays frozen, the VSCode-side xterm.js renderer is in the loop too (downgrades H1/H2 to "VSCode-extension-specific WS handling" instead of "Tower-side").Why this is PIR, not BUGFIX
Root cause is not yet known — three plausible mechanisms with overlapping symptoms. Investigation phase (capture diagnostics, narrow to a single mechanism, then design the fix) is the right shape. Pre-PR running-on-real-workload verification is load-bearing because the bug only manifests after time-in-use.
Out of scope
ReferenceError: exitCode is not definedcron evaluator bug visible in the same Tower log window. Filed separately.tower stopshellper-killing behavior. Shellpers must continue surviving restart per Architect terminal should survive Tower restarts (shellper persistence) #274 / Multi-architect conversation resume: disambiguate via per-architect session UUID #832 / Terminals survive a Tower restart: preserve terminal id + stop afx tower stop killing port clients (#991) #999 / terminal: stale tab on a pre-restart terminal id can't self-recover without a manual state re-fetch #991.