tower + vscode: fix terminal freeze (oversized-replay storm) and blank-on-open#1050
Open
amrmelsayed wants to merge 31 commits into
Open
tower + vscode: fix terminal freeze (oversized-replay storm) and blank-on-open#1050amrmelsayed wants to merge 31 commits into
amrmelsayed wants to merge 31 commits into
Conversation
…ot-reconnect, listener hygiene)
… + reconnect listener hygiene
…eplay across restart)
…on live overload, resume-delta reconnect
…n a Tower monitor interval
…ment the three distinct size limits
…TUI replay The byte caps (new in this PR) front-trimmed the alt-screen stream, dropping the screen-setup prefix and blanking terminals until a manual resize. They were never needed for the freeze fix (scan-only pushData fixes CPU; the pause/resume bracket + drop-not-reconnect fix the storm) — they only bounded memory/replay size, which #1047 rated minor. Reverts partial/replay to main's faithful behavior; keeps scan-only, the bracket, drop-not-reconnect, resume-delta, listener hygiene, and the (now growth-observability) partial monitor.
…ry [#1047-diag] logging
…GWINCH-on-connect); revert Option A
…achShellper idempotency) tests
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PIR Review: Terminal freeze + blank-on-open (oversized-replay storm and missing connect-time redraw)
Fixes #1047
Summary
Two client/server terminal bugs were fixed. (1) Freeze: a full-screen TUI (Claude's UI) redraws in place and emits almost no newlines, so Tower's reconnection buffers — bounded by line count — grew unbounded; the resulting multi-MB replay overflowed the VSCode client's 1 MB receive budget, and the client's "disconnect-and-reconnect for replay" response re-fetched the same oversized snapshot in a tight loop (~14k cycles/hour), pegging one CPU core and starving every terminal. (2) Blank-on-open: a freshly-attached terminal could render a blank pane until a manual window resize, because the connect-time resize could be a same-size no-op and a full-screen TUI only repaints on a size change. Tower now scans only new output (killing the O(n²) re-scan) and brackets replay with
pause/resumeso it is paced and excluded from backpressure; the client drops live output under overload instead of reconnecting, reconnects with a?resume=<seq>delta, and forces a guaranteed redraw shortly after connect (mirroring the web dashboard).Files Changed
packages/codev/src/terminal/ring-buffer.ts(+39 / -…) —pushDatascans only the incoming chunk (O(|data|), not O(|partial|));partialkept whole for faithful replay;partialBytesgetter for observability.packages/codev/src/terminal/pty-session.ts(+14) —attachShellperidempotency (drop prior client listeners on re-attach);partialBytespassthrough.packages/codev/src/terminal/pty-manager.ts(+30 / -…) —createSessionRawtears down a colliding session before overwrite;inspectPartials()for the monitor.packages/codev/src/agent-farm/servers/tower-websocket.ts(+8 / -…) — bracket replay withpause/resumecontrol frames.packages/codev/src/agent-farm/servers/tower-server.ts(+32) — periodic ring-buffer-partial monitor (observability).packages/vscode/src/terminal-adapter.ts(+124 / -…) — replay excluded from backpressure; live overload drops instead of reconnecting;?resume=<lastSeq>delta reconnect; post-connect repaint nudge.packages/codev/src/terminal/__tests__/ring-buffer.test.ts(+25),packages/codev/src/agent-farm/__tests__/tower-websocket.test.ts(+14 / -…),packages/vscode/src/__tests__/terminal-adapter.test.ts(+137 / -…) — unit coverage.packages/vscode/CHANGELOG.md(+7) — user-facing entries for both fixes.codev/resources/arch.md(+4),codev/resources/lessons-learned.md(+3) — see sections below.(
git diff --stat: 12 files, +410 / -27, excluding plan/thread/status artifacts.)Commits
8868ac86[PIR tower + vscode: terminal freeze from oversized-replay reconnect storm (unbounded no-newline buffers + client backpressure loop) #1047] Fix A+E: byte-cap RingBuffer.partial (scan-only pushData) + reconnect listener hygiene7d42dac9[PIR tower + vscode: terminal freeze from oversized-replay reconnect storm (unbounded no-newline buffers + client backpressure loop) #1047] Fix B: byte-cap ShellperReplayBuffer (bounds no-newline replay across restart)f7e6b3a2[PIR tower + vscode: terminal freeze from oversized-replay reconnect storm (unbounded no-newline buffers + client backpressure loop) #1047] Fix C+D: pause/resume replay bracket, drop-not-reconnect on live overload, resume-delta reconnect9f282745[PIR tower + vscode: terminal freeze from oversized-replay reconnect storm (unbounded no-newline buffers + client backpressure loop) #1047] Instrumentation: log terminal ring-buffer partial sizes on a Tower monitor interval757a75a5[PIR tower + vscode: terminal freeze from oversized-replay reconnect storm (unbounded no-newline buffers + client backpressure loop) #1047] Update tower-websocket test for pause/resume replay bracket2f5e6351[PIR tower + vscode: terminal freeze from oversized-replay reconnect storm (unbounded no-newline buffers + client backpressure loop) #1047] docs: user-facing CHANGELOG entry for terminal-freeze fixb05b73f2[PIR tower + vscode: terminal freeze from oversized-replay reconnect storm (unbounded no-newline buffers + client backpressure loop) #1047] Hoist replay-cap resolution out of constructor args; document the three distinct size limits02ff5546[PIR tower + vscode: terminal freeze from oversized-replay reconnect storm (unbounded no-newline buffers + client backpressure loop) #1047] Remove byte caps: keep partial/replay whole for faithful TUI replay7100a5a3[PIR tower + vscode: terminal freeze from oversized-replay reconnect storm (unbounded no-newline buffers + client backpressure loop) #1047] Re-assert terminal size after replay (Option A) + temporary diag logging81c4a83b[PIR tower + vscode: terminal freeze from oversized-replay reconnect storm (unbounded no-newline buffers + client backpressure loop) #1047] Force a post-connect redraw nudge (mirror web client's SIGWINCH-on-connect); revert Option A6d994f72[PIR tower + vscode: terminal freeze from oversized-replay reconnect storm (unbounded no-newline buffers + client backpressure loop) #1047] docs: changelog entry for terminal blank-on-open fixcc7fbdc7[PIR tower + vscode: terminal freeze from oversized-replay reconnect storm (unbounded no-newline buffers + client backpressure loop) #1047] Strip temporary [tower + vscode: terminal freeze from oversized-replay reconnect storm (unbounded no-newline buffers + client backpressure loop) #1047-diag] diagnostic logging(Plus the review/test-coverage commit carrying this file.)
Test Results
pnpm build: ✓ passpnpm --filter @cluesmith/codev test: ✓ 3308 passed / 0 failed / 48 skippedpackages/vscodeadapter suite: ✓ 20 passed (including 5 storm-prevention + 2 repaint-nudge tests new for tower + vscode: terminal freeze from oversized-replay reconnect storm (unbounded no-newline buffers + client backpressure loop) #1047); full VSCode suite previously green at 414.dev-approvalgate, macOS VSCode against the worktree Tower): confirmed via the temporary[#1047-diag]client log that there is no connect→backpressure→reconnect storm,seqheartbeats stay clean, and — after the nudge landed — terminals paint on open without a manual resize.Architecture Updates
COLD (
codev/resources/arch.md) — added a "Terminal reconnect/replay contract (#1047)" subsection under Shellper Lifecycle documenting the now-load-bearing behaviors: replay bracketed bypause/resume(excluded from client backpressure),?resume=<seq>delta reconnect, the deliberately-unboundedpartial(faithful TUI replay) with CPU bounded by scan-onlypushData, the client's connect-time forced redraw, and drop-not-reconnect under live overload.No HOT (
arch-critical.md) change: these are subsystem-level protocol details, not a top-tier always-injected system-shape fact, and the hot file is at its cap.Lessons Learned Updates
COLD (
codev/resources/lessons-learned.md, Debugging section) — three[From 1047]entries: (1) cross-client differential diagnosis — a symptom in one client but not another sharing the backend localizes the bug to the client layer (web rendered fine → the app was painting → the VSCode adapter was at fault), and a full-screen TUI only repaints on a size change; (2) a buffer bounded by newline count is unbounded for a newline-free stream, with the on-disk PTY logs ("longest run without\n") as the decisive evidence, and the caution that front-trimming a buffer corrupts TUI replay; (3) a backpressure relief valve that re-fetches the overflowing payload is an infinite loop — drop ephemeral output, don't reconnect.No HOT (
lessons-critical.md) change: these are debugging recipes, not a top-tier cross-cutting rule, and the hot file is at its cap.Things to Look At During PR Review
RingBuffer.partialandShellperReplayBuffer; they were reverted (02ff5546) after dev-approval testing showed front-trimming corrupts a full-screen TUI's replay (its alt-screen state lives in the cumulative stream from the alt-screen-enter onward). The freeze is fixed by scan-onlypushData+ the replay bracket, not by bounding the buffer. Net effect:partial/replay are functionally back tomain's faithful-but-unbounded behavior; the newtower-server.tsmonitor logs partial size so unbounded growth (a known, issue-rated-minor trade-off) is observable.terminal-adapter.ts,scheduleRepaintNudge). It is arows-1→rowssize delta 500ms after connect, gated onrenderedSinceConnectso a reconnect that already painted via replay doesn't reflow. The brief 1-row intermediate frame is intentional (guarantees a realTIOCSWINSZdelta even when the PTY is already at the target size). Confirm the gating and that it's cleared on close/reconnect (resetStreamState).#737and the nudge are kept separate (deliberate):#737's on-open resize sets the correct size immediately (no flicker, common case); the nudge is the delayed backstop for the same-size no-op case. They're complementary, not redundant.REQUEST_CHANGESfrom the 3-way consult is not independently re-reviewed — please scrutinize the disposition at this gate.How to Test Locally
pir-1047→ Review Diff.afx dev pir-1047.?resume=(delta).Flaky Tests
None skipped. (Earlier full-suite runs showed environmental failures in
session-manager/adopt/hot-tierintegration tests whendist/and the copied skeleton were absent; a fullpnpm buildresolves them — they are not related to this change and pass after build.)