diff --git a/codev/plans/1047-tower-terminals-architects-bui.md b/codev/plans/1047-tower-terminals-architects-bui.md new file mode 100644 index 000000000..4cd57a569 --- /dev/null +++ b/codev/plans/1047-tower-terminals-architects-bui.md @@ -0,0 +1,133 @@ +# PIR Plan: Tower terminal freeze — oversized replay storm from unbounded no-newline buffers + +> Issue #1047. Tower terminals (architects + builders) become non-responsive over time; restart is the only known recovery, and even that is **not reliably** effective. + +## Understanding + +### Root cause (confirmed end-to-end with captured data) + +A full-screen TUI (Claude Code's UI) runs in the alternate screen buffer and redraws **in place** with cursor-addressing and carriage returns, emitting almost **no `\n`**. Several buffers in the terminal path bound themselves **by newline count**, so they never bound for such a stream. The unbounded buffer then produces a **replay payload larger than the client's 1 MB backpressure limit**, which drives the VSCode terminal client into a **tight, backoff-free disconnect/reconnect/replay loop**. That loop — re-serializing a multi-MB replay thousands of times on Tower's single event loop — is the dominant CPU sink and starves every other terminal. + +The closed causal loop: + +1. **Unbounded accumulation.** `RingBuffer.partial` (`ring-buffer.ts:36-51`) does `combined = this.partial + data; combined.split('\n')` every frame and keeps the trailing fragment; with no `\n`, `partial` grows without bound and each frame re-scans the whole thing (O(partial)/frame → O(n²)/session). The shellper's `ShellperReplayBuffer` (`shellper-replay-buffer.ts:45,58`, fed at `shellper-process.ts:143`) evicts only `while lineCount > maxLines`, so with zero newlines it never evicts. Neither has a byte cap (the only byte cap in the layer is the *stderr* ring's `maxLineLength=10000` at `session-manager.ts:67`). +2. **Oversized replay.** On every (re)connect, Tower sends the buffer as one frame: `ws.send(encodeData(replayLines.join('\n')))` (`tower-websocket.ts:62-65`), where `replayLines = ringBuffer.getAll()` = `[partial]`. +3. **Client backpressure trips on the replay itself.** The VSCode client's `handleData` (`packages/vscode/src/terminal-adapter.ts:300-308`) does `this.queuedBytes += payload.length; if (this.queuedBytes > MAX_QUEUE /*1 MB*/) { reconnect() }`. The replay frame alone exceeds 1 MB, so it disconnects **before rendering anything**. +4. **Reconnect with no backoff.** `reconnect()` (`terminal-adapter.ts:281-296`) calls `this.backoff.reset()` and reconnects immediately, opening the bare `/ws/terminal/` (no `resume`) → Tower re-sends the same full oversized replay → back to step 3. Infinite loop. + +### Captured evidence + +PTY disk logs (`~/.agent-farm/logs/*.log`, exact terminal output): + +| Session | Size | Newlines | Longest run without `\n` | +|---|---|---|---| +| `f2dc55d1…` (**the storming terminal**) | 1.9 MB | **0** | **1.89 MB (entire file)** | +| `f02bedcb…` (incident window) | 15 MB | **0** | 14.57 MB | +| `d8406afb…` / `4d5523cb…` (normal) | 20–24 MB | ~1.1 M | 5–7 KB | + +`f2dc55d1`'s replay = 1.9 MB **> 1 MB** `MAX_QUEUE`. Byte census of the 15 MB log: 1,500,892 ESC, 164,652 CR, **0 LF** — a Claude alt-screen redraw stream (`\e[?1049h \e[2J \e[H … \e[3G \r`). + +VSCode extension log (`tmp/vscode-log.txt`, 42,143 lines, ~58 min): **14,026** "WebSocket connected" / **14,015** "Backpressure exceeded 1 MB" / **14,026** "Connecting to" (1:1:1), **14,017** of them targeting the single terminal `f2dc55d1`. Only **~1 MB** of no-newline output is needed to start the storm. + +### Why this matches every observation + +- **CPU ~93% (one core), all terminals freeze at once:** Tower re-builds + re-serializes a multi-MB replay ~14,000 times on its single event loop, starving every other terminal. Input stays responsive (tiny); output is frozen — matching the stray-`e` screenshot. +- **Memory barely moves:** the cost is CPU + GC churn, not retained memory. +- **Restart is NOT reliable:** `ShellperReplayBuffer` is also unbounded, so after restart the shellper replays the multi-MB blob, Tower re-seeds an oversized `partial`, and the storm resumes unless the offending session has gone quiet/ended. + +The earlier EventEmitter listener-leak hypothesis is **demoted to a defensive cleanup** (folded in below) — it explains none of the captured evidence, whereas this chain explains all of it. + +### Local vs remote hosting raises the stakes + +Tower can run on loopback (same machine as VSCode) or remotely (cloud / via the tunnel). Three things flip when remote, and they shape the fix: + +- **Reconnects are frequent remotely** (WAN blips, tunnel timeouts, sleep/wake) — so a *full replay on every reconnect* is paid routinely, not just during the storm. Reconnect-path leaks also accumulate proportionally faster. +- **Bytes cost money and bandwidth remotely** — 14k × 1.9 MB ≈ 26 GB/hr for one terminal; even a *bounded* buffer re-shipped on every reconnect is wasteful. +- **"Reconnect to relieve backpressure" is harmful remotely** — it re-downloads the whole buffer over the congested link. Notably Tower's *live-output* side already does the resilient thing (`if (ws.bufferedAmount < WS_HIGH_WATER_MARK) ws.send(...)` — it **drops** ephemeral frames, `tower-websocket.ts:47`); the client's remedy is the opposite and must be brought into line. + +## Proposed Change + +One coordinated change built on four resilience principles (correct regardless of hosting; load-bearing remotely): bound by bytes, make reconnect a cheap delta, separate replay from live overload, and keep a hard last-resort safety net. + +### Fix A (Tower) — `RingBuffer`: scan-only, byte-cap, and byte-addressable resume + +- **Scan only the incoming `data`** for newlines instead of re-splitting `partial + data` → per-frame work O(|data|), killing the O(n²) re-scan. Behavior-preserving for replay. +- **Byte-cap `partial`** (`MAX_PARTIAL_BYTES`), front-trimming an over-cap unbroken run. Bounds CPU/memory/bandwidth. Front-trim (not synthetic `\n`) avoids corrupting a TUI replay; reconnect drives a full repaint that self-heals. +- **Byte-addressable sequence:** advance a monotonic byte counter on every `pushData` (not only on completed lines), and make `getSince(seq)` return the bytes after that offset (reconstructed across the retained lines + partial). Today `seq` only moves on a completed line, so for a no-newline stream `currentSeq` stays 0 and resume returns the whole partial — defeating resume exactly when it's needed most. This is what makes Fix C's delta-reconnect actually work for TUI streams. If a client's position is older than the retained window, fall back to a bounded full replay (≤ cap). + +### Fix B (shellper) — `ShellperReplayBuffer`: byte cap + +Add a `maxBytes` cap alongside `maxLines` (it already tracks `totalBytes`): evict oldest chunks while `totalBytes > maxBytes`, with a single-chunk front-trim edge case mirroring the line logic. Bounds shellper memory and the REPLAY frame so restart can't re-seed an oversized buffer. + +### Fix C (VSCode client) — reconnect/backpressure redesign (`terminal-adapter.ts`) + +- **Delta reconnect:** reconnect with `?resume=` (the client already tracks `lastSeq` from `seq` control frames — it's currently dead code) so a reconnect ships only new bytes, not the whole buffer. The biggest remote win. +- **Replay excluded from backpressure:** deliver the replay burst through the already-paced `writeChunked` path (yields via `setImmediate`) and do **not** count it toward `queuedBytes`. Use the half-present `pause`/`resume` control bracket (`terminal-adapter.ts:345-355`) to mark "this is replay" so the client renders any replay size without tripping. This also removes the hard cross-package `MAX_PARTIAL_BYTES < MAX_QUEUE` coupling — once replay never counts as backpressure, the caps are about Tower cost, not client safety. +- **Live overload drops, not reconnects:** when *live* (post-replay) output exceeds the budget, drop/coalesce frames (terminal output is ephemeral; the next repaint heals it), mirroring Tower's `bufferedAmount` drop. Reconnect becomes a last resort, never the routine remedy. +- **Hard safety net kept:** do not `backoff.reset()` on a backpressure-driven reconnect; retain the backoff curve + give-up so no Tower bug, tunnel flap, or future regression can busy-loop. + +### Fix D (Tower) — replay bracket + resume wiring (`tower-websocket.ts`) + +Bracket the replay send with `pause` → replay data → `resume` control frames so the client can exclude it from backpressure; honor `?resume=` via `attachResume` (already present) using Fix A's byte-aware `getSince`; emit the byte-aware `seq`. + +### Fix E (Tower) — reconnect-path listener hygiene (folded in) + +`attachShellper` (`pty-session.ts:119-197`) becomes idempotent (remove prior `data`/`exit`/`close` listeners before re-subscribing); the on-the-fly reconnect / `createSessionRaw` path (`pty-manager.ts:161`) tears down any pre-existing PtySession under a reused id before replacing it. Frequent remote reconnects make this matter more than it does locally. + +### Cap configuration + invariant + +`MAX_PARTIAL_BYTES` (~256 KB) and `ShellperReplayBuffer maxBytes` (~ a few MB) configurable via env for cloud tuning. With Fix C the `< MAX_QUEUE` relationship is no longer correctness-critical, but keep a startup assertion/log if both are set so a misconfig is visible rather than silent. + +### Instrumentation + +On the existing 30 s SSE heartbeat, log per-session `ringBuffer` partial length + shellper replay byte size, and `WARN` (with terminal id) when a partial exceeds a threshold. + +### Explicitly out of scope + +- Per-frame `fs.writeSync` → async/batched disk log (separate optimization; note for follow-up if profiling after A still shows it). +- Default `tower stop` shellper-survival (#274/#832/#999/#991); cron `ReferenceError` (#1048); spawn-failure sibling (#1038). + +## Files to Change + +- `packages/codev/src/terminal/ring-buffer.ts` — scan-only `pushData`; byte-cap `partial`; byte-addressable `seq` + `getSince`. `MAX_PARTIAL_BYTES`. +- `packages/codev/src/terminal/shellper-replay-buffer.ts` — `maxBytes` cap + byte eviction; thread default/option from `shellper-process.ts:97`. +- `packages/vscode/src/terminal-adapter.ts` — delta reconnect (`?resume=`), replay-excluded-from-backpressure (`pause`/`resume` bracket), live-overload drop, keep backoff/give-up (`handleData` ~300-321, `reconnect` ~281-296, `connect` ~130-160). +- `packages/codev/src/agent-farm/servers/tower-websocket.ts` — bracket replay with `pause`/`resume`; resume via byte-aware `getSince`; emit byte-aware `seq` (`53-75`). +- `packages/codev/src/terminal/pty-session.ts` — `attachShellper` idempotency (Fix E). +- `packages/codev/src/terminal/pty-manager.ts:161` — `createSessionRaw` teardown of reused-id session (Fix E). +- `packages/codev/src/agent-farm/servers/tower-server.ts:236-262` — instrumentation + cap-config validation. +- Possibly `@cluesmith/codev-types` / `ws-protocol` — only if the `pause`/`resume`/`seq` control shapes need additions (they appear to exist already). +- Tests under `packages/codev/src/terminal/__tests__/`, `packages/codev/src/agent-farm/__tests__/`, and `packages/vscode/` (see Test Plan). +- `packages/vscode/CHANGELOG.md` + `docs/releases/UNRELEASED.md` — user-facing entry. + +This is a large, multi-package change. It is sequenced so commits are independently reviewable (A → B → D → C → E → instrumentation), and the `dev-approval` reviewer can exercise it incrementally. The minimal safety net (Fix C backoff/give-up) means even a partial landing cannot regress into a storm. + +## Risks & Alternatives Considered + +- **Risk: byte-addressable resume is subtle** (reconstructing the delta across the line/partial boundary; off-by-one re-sends or gaps). Mitigation: unit tests asserting `getSince(seq)` returns exactly the bytes after `seq` for newline and no-newline streams; the bounded-full-replay fallback covers the out-of-window case; the backpressure-exclusion (Fix C) means even a wrong-but-large resume can't storm. +- **Risk: front-trimming `partial` drops early escape state** (e.g. alt-screen enter). Mitigation: reconnect drives a full TUI repaint; replay-correctness test asserts a normal newline stream is byte-identical before/after. +- **Risk: dropping live frames loses output.** Mitigation: only under genuine sustained overload (same regime Tower already drops in); terminal output is ephemeral and repainted; this is strictly better than the freeze it replaces. +- **Risk: PR size / blast radius across core + shellper + vscode.** Mitigation: sequenced commits; each fix is independently correct; behind no feature flag but each is small and locally testable; `dev-approval` gates the running result. +- **Alternative — land A/B/C-minimal now, resume/E as fast-follow.** Considered and set aside per decision to do it all together; the remote-resilience value of resume + clean reconnect path is the point. +- **Alternative — just raise `MAX_QUEUE`.** Rejected: moves the threshold, fixes neither the Tower O(n²)/memory nor the remote bandwidth cost. + +## Test Plan + +Run from the worktree (`pnpm --filter @cluesmith/codev …` for core; build the VSCode package). + +- **Unit — RingBuffer:** (a) no-newline stream keeps `partial.length ≤ cap` and per-frame cost flat; (b) normal newline stream replays byte-identically (no regression); (c) `getSince(seq)` returns exactly the post-`seq` bytes for both newline and no-newline streams, with the bounded fallback when out of window. +- **Unit — ShellperReplayBuffer:** zero-newline stream over `maxBytes` keeps `size ≤ maxBytes`; `getReplayData()` returns the bounded tail. +- **Unit/component — client (`terminal-adapter`):** (a) a replay frame > `MAX_QUEUE` renders (paced) and does **not** enter a reconnect loop; (b) reconnect issues `?resume=`; (c) sustained live overload drops/coalesces rather than reconnecting; (d) repeated genuine backpressure backs off and eventually gives up (no busy-loop). +- **Unit — Fix E:** many reconnect/re-attach cycles keep shellper-client listener counts and live PtySession count bounded; a single data frame is processed exactly once. +- **Build + suites:** core build + test green; VSCode extension builds. +- **Manual / live (at `dev-approval`), both hosting modes:** + - *Local:* open an architect terminal running Claude's full-screen UI; drive it past ~1 MB of no-newline output. Confirm via `afx tower log -f` that partial size plateaus at the cap and CPU stays low, and via the VSCode terminal log that there is **no** connect→backpressure→reconnect storm (contrast the 14k-cycle capture). Reconnect the tab and confirm correct repaint and that the reconnect used `resume` (delta, small payload). + - *Remote-ish:* repeat against a Tower reached over the tunnel (or with induced latency/drops); confirm reconnects ship deltas not full buffers, no storm, and the terminal stays usable across forced disconnects. +- **Soak (post-merge / verify, non-gating):** several hours on a real workload; CPU flat, no storms in the VSCode log. + +## Decisions (confirmed with architect — "do all together") + +1. **Caps:** env-configurable; `MAX_PARTIAL_BYTES` ~256 KB, `ShellperReplayBuffer maxBytes` ~ a few MB; invariant no longer correctness-critical thanks to Fix C, but surfaced if misconfigured. +2. **Fix C:** the full design — delta `resume` reconnect + replay-excluded-from-backpressure + live-overload-drops + retained backoff/give-up. +3. **Listener hygiene (Fix E):** folded into this PR. +4. **Resume/delta reconnect (incl. byte-addressable seq):** included — the remote-resilience centerpiece. diff --git a/codev/projects/1047-tower-terminals-architects-bui/1047-review-iter1-rebuttals.md b/codev/projects/1047-tower-terminals-architects-bui/1047-review-iter1-rebuttals.md new file mode 100644 index 000000000..f4d1e8402 --- /dev/null +++ b/codev/projects/1047-tower-terminals-architects-bui/1047-review-iter1-rebuttals.md @@ -0,0 +1,39 @@ +# PIR #1047 — CMAP iteration-1 rebuttals / dispositions + +3-way consult verdicts: **Codex `REQUEST_CHANGES`**, **Claude `APPROVE`**, **Gemini failed** (see below). PIR is single-pass (`max_iterations: 1`) — these dispositions are not independently re-reviewed; the human at the `pr` gate is the final check. + +## Gemini — DISREGARD (not a real review) + +The Gemini consult misfired: its output is *"Hi! The workspace directory is currently empty, and your prompt was `--sandbox`. Could you please clarify…"* — it ran against an empty sandbox and never saw the diff. This is a known agy/Gemini-consult environment issue, not a verdict. Porch grep-defaulted it to REQUEST_CHANGES because the text lacks "APPROVE". No actionable content; disregarded. + +## Codex — REQUEST_CHANGES (HIGH) + +### C1. "Byte-addressable seq not implemented → no true delta resume for no-newline streams" — REBUTTED (deliberate descope), with a doc fix + +Accurate observation, but not a defect. `seq` stays line-based, so for a pure no-newline TUI stream `currentSeq` stays at the last real line and `connectUrl()` sends a full (faithful) replay rather than a `?resume=` delta. + +This was **deliberately descoped during dev-approval**. The plan's byte-addressable seq existed to make a *bounded* (byte-capped) buffer resumable. But the byte caps were reverted (see C2) because front-trimming corrupts a full-screen TUI's replay — its alt-screen state lives in the cumulative stream from the alt-screen-enter onward, so a trimmed buffer renders blank/garbled (the exact regression the human caught at the gate). With the buffer kept whole, a no-newline reconnect does a full faithful replay, which is correct — it's `main`'s behavior, just without the bandwidth optimization for the pathological case. Correctness on (re)connect is guaranteed by the **post-connect repaint nudge** (forces a redraw SIGWINCH), which is implemented and human-verified. + +Re-introducing byte-addressable seq + byte caps to chase the delta optimization would reintroduce the replay-corruption regression. Not warranted. + +**Fix applied (the legitimate part):** the implementation was correct but the *docs overstated* it. Corrected: +- Review Summary now scopes `?resume=` to "newline-bearing streams" and adds an explicit Scope note on the descope. +- `RingBuffer.getSince` now carries a comment documenting that `seq` advances only on completed lines, that a caught-up no-newline client gets `[]`, and that the repaint nudge covers it. + +### C2. "Shellper byte cap (Fix B) absent → unbounded restart replay for zero-newline sessions" — REBUTTED (deliberate descope) + +Accurate: `ShellperReplayBuffer` is line-bounded only, so a no-newline session's shellper replay grows unbounded and Tower replays it fully on restart. This was **reverted on purpose**, for the same reason as the RingBuffer byte cap: a byte cap front-trims the buffer, which corrupts the TUI replay Tower reseeds from on reconnect. Unlike the Tower side, the shellper side has **no O(n²) hot path** (it just appends chunks and counts newlines), so its absence does **not** reintroduce the freeze — the freeze is fixed by scan-only `pushData` + the replay bracket + drop-not-reconnect, none of which depend on Fix B. The remaining cost is memory growth for long no-newline sessions, which issue #1047 explicitly rated minor/orthogonal ("+76 MB over 10h… probably orthogonal; the CPU is the load-bearing signal"), and which is now **observable** via the new `tower-server.ts` partial-size monitor. Accepted trade-off; not re-adding. + +### C3. "Test coverage and review overstate what shipped" — PARTIALLY VALID, FIXED + +The review did describe `?resume=` delta reconnect as the delivered contract without the no-newline caveat. **Fixed** (C1's doc corrections). There is intentionally no "no-newline seq/getSince regression test" because there is no byte-seq behavior to pin — the line-based behavior is now documented in the `getSince` comment instead. + +## Claude — APPROVE (HIGH), with minor notes — ADDRESSED + +- **"No Fix E unit tests"** — **Fixed.** Added `packages/codev/src/terminal/__tests__/pty-session-attach.test.ts` (3 tests): re-attach drops the previous client's `data`/`exit`/`close` listeners; a stale frame on the detached client no longer reaches the ring buffer; same-instance re-attach is a no-op. +- **"Document the getSince no-newline limitation in the code comment"** — **Fixed** (same comment as C1). +- **"`docs/releases/UNRELEASED.md` not updated"** — Not applicable: the file does not exist on this branch (only `UNRELEASED.template.md`); it's a between-releases gap, not an omission. CHANGELOG.md carries both user-facing entries. + +## Net change from this iteration + +Docs/comments corrected for accuracy; one regression-test gap (Fix E) closed. No behavioral code change — the descoped items (byte caps, byte-addressable seq) are intentionally not re-added, as re-adding them would regress the human-verified faithful-replay behavior. The seq/Fix-B descope is the decision the human should confirm at the `pr` gate. diff --git a/codev/projects/1047-tower-terminals-architects-bui/status.yaml b/codev/projects/1047-tower-terminals-architects-bui/status.yaml new file mode 100644 index 000000000..b0534da96 --- /dev/null +++ b/codev/projects/1047-tower-terminals-architects-bui/status.yaml @@ -0,0 +1,29 @@ +id: '1047' +title: tower-terminals-architects-bui +protocol: pir +phase: review +plan_phases: [] +current_plan_phase: null +gates: + plan-approval: + status: approved + requested_at: '2026-06-14T22:20:45.222Z' + approved_at: '2026-06-14T23:10:56.668Z' + dev-approval: + status: approved + requested_at: '2026-06-14T23:33:49.007Z' + approved_at: '2026-06-15T06:08:32.090Z' + pr: + status: pending + requested_at: '2026-06-15T06:25:13.811Z' +iteration: 1 +build_complete: true +history: [] +started_at: '2026-06-14T22:12:29.323Z' +updated_at: '2026-06-15T06:25:13.812Z' +pr_history: + - phase: review + pr_number: 1050 + branch: builder/pir-1047 + created_at: '2026-06-15T06:15:14.783Z' +pr_ready_for_human: true diff --git a/codev/resources/arch.md b/codev/resources/arch.md index b6e545104..07ff828f0 100644 --- a/codev/resources/arch.md +++ b/codev/resources/arch.md @@ -206,6 +206,10 @@ Shell / Claude / Builder process 5. **Kill**: Tower sends SIGTERM via SIGNAL frame, waits 5s, SIGKILL if needed. Cleans up socket file. 6. **Graceful degradation**: If shellper spawn fails, Tower falls back to direct node-pty (non-persistent). SQLite row has `shellper_socket = NULL`. Dashboard shows "Session persistence unavailable" warning. +#### Terminal reconnect/replay contract (#1047) + +When a WebSocket client attaches, Tower replays the ring buffer as a binary DATA frame **bracketed by `pause`/`resume` control frames** so the client can render the (potentially large) snapshot without counting it against its live-backpressure budget. Clients pass `?resume=` to request only the bytes after a sequence number (a *delta* reconnect) instead of the full buffer — the dominant cost saver when Tower is hosted remotely and reconnects are frequent. The ring-buffer `partial` (incomplete trailing line) is kept **whole and unbounded** so a full-screen TUI's alt-screen state replays faithfully; per-frame CPU is bounded instead by `pushData` scanning only the new chunk (not re-splitting the accumulated buffer). On the client side, a connecting terminal must **force a redraw** shortly after attach (a size-delta resize → SIGWINCH) — a full-screen TUI only repaints on a size *change*, and the connect-time resize can be a same-size no-op; both the web dashboard (`Terminal.tsx`) and the VSCode adapter (`terminal-adapter.ts`) do this. Under genuine *live* overload, clients **drop** ephemeral output (the app repaints) rather than reconnecting — reconnecting to relieve backpressure re-pulls the same payload and storms. + #### Startup Readiness Barrier (#997) Tower binds its port and starts serving immediately, but `reconcileTerminalSessions()` (which re-registers persistent sessions in the `workspaceTerminals` map) runs *after* `server.listen()`. To stop the first post-restart read from seeing a half-populated `role → terminalId` map, a monotonic settled-once barrier in `tower-terminals.ts` gates the readers of reconcile's output: diff --git a/codev/resources/lessons-learned.md b/codev/resources/lessons-learned.md index 82075f2f1..03371829d 100644 --- a/codev/resources/lessons-learned.md +++ b/codev/resources/lessons-learned.md @@ -424,6 +424,9 @@ Generalizable wisdom extracted from review documents, ordered by impact. Updated - [From 913] VSCode's `TreeView` exposes `reveal(…, {expand})` but **no per-item collapse** — and it replays its own remembered expand/collapse state for any `TreeItem.id` it has already seen, ignoring the provider's `collapsibleState` on a re-render. The only blunt collapse primitive, `workbench.actions.treeView..collapseAll`, is *tree-wide* (it flattens group headers too, not just the rows you meant). To force *specific* rows collapsed while leaving others alone, **version the id**: render each row as `#` and bump the version for the rows you want collapsed — a never-seen id makes VSCode fall back to the provider's default state. Keep the row you want left open at its *verbatim* current id (don't reconstruct it from a counter — store the literal string, so a render landing between a bump and the next interaction can't strand it on a stale version). The monotonic version is load-bearing: a re-opened row must never reuse an id VSCode remembers as expanded, or it pops back open. Encapsulate the three moving parts (version, open-id, open-builder) in one small named unit rather than scattering counters across the provider. - [From 913] Don't persist UI state whose lifetime is shorter than the thing it describes. The Builders tree's per-group expansion was saved to `workspaceState`, but active builders live hours and the set churns constantly — persisting "you collapsed VSCODE last Tuesday" across reloads served no one and actively masked an accordion bug (programmatic `collapseAll` wrote `false` for every group, indistinguishable from a user chevron-click). Dropping persistence entirely and defaulting to expanded each session is *less* code and the better UX; VSCode's native per-id in-session memory already keeps a user-collapsed group collapsed until reload, for free. Persistence still earns its keep in the sibling Backlog view (hundreds of long-lived items, users curate which areas matter) — the lesson is to match persistence to lifetime per-view, not to apply one mechanism uniformly. +- [From 1047] When the same symptom appears in one client but not another that shares the backend, the bug is in the client-specific layer — diff the two clients, don't debug the backend. Terminals froze/blanked in the VSCode extension but rendered fine in the web dashboard, both hitting the same Tower/WS/PTY/shellper/Claude. That single fact proved the app *was* painting and the bytes *did* exist, so the cause was the extension's connect/render path — not a missing SIGWINCH on the app side, which is where two failed hypotheses had been chasing it. Reading the web client (`Terminal.tsx`) revealed the missing piece: it unconditionally sends a "redraw" resize ~500ms after connect ("send SIGWINCH to make the shell redraw"), and even has a `skipReplay` mode that *just* sends SIGWINCH. The VSCode `Pseudoterminal` adapter had no equivalent, so a connect-time resize that matched the PTY's size was a no-op and the full-screen TUI never repainted. The fix mirrors the web: a gated post-connect nudge (`rows-1` then `rows`) guarantees a real `TIOCSWINSZ` delta. Corollary: a full-screen TUI in the alternate screen buffer only repaints on a size *change* — a same-size resize sends no SIGWINCH, so "re-assert the size" must be a real delta, not a re-send. +- [From 1047] A buffer bounded by newline count is unbounded for a newline-free stream. `RingBuffer.partial` and `ShellperReplayBuffer` both capped by *lines*; a full-screen TUI redraws in place with cursor-addressing and `\r` and emits almost no `\n`, so a single 15 MB session had zero newlines — the partial grew without limit and every `pushData` re-split the whole accumulated string (O(n²)). The decisive evidence was the on-disk PTY logs (`~/.agent-farm/logs/*.log`): measuring "longest run without `\n`" instantly distinguished the pathological session (whole file) from normal ones (~20 bytes/line). Fix the hot path by scanning only the *new* chunk; bound by bytes only where a real ceiling is needed. (Caution: a byte cap that front-trims the buffer corrupts TUI *replay* — the alt-screen state lives in the cumulative stream from the alt-screen-enter onward — so we kept the buffer whole and bounded CPU via scan-only instead.) +- [From 1047] A backpressure "relief valve" that re-fetches the overflowing payload is an infinite loop. The VSCode terminal's old rule was "receive buffer > 1 MB → disconnect and reconnect for a fresh replay" — but the oversized thing *was* the replay, so each reconnect re-delivered it: ~14,000 connect→overflow→reconnect cycles in under an hour (visible 1:1:1 in the client log), pegging one core and starving every terminal. For ephemeral live output, *drop* under overload (the app repaints); reserve reconnect for genuine transport loss, and never make the overload remedy re-pull the same bytes. - [From 787] When adding a field to a multi-forge concept contract (`pr-list`, `issue-list`, …), per-CLI data availability diverges — do not assume parity across `gh`/`glab`/`tea`. Verify each empirically (live output or, failing that, the CLI's own `--help` and official docs): `gh pr list --json` exposes `isDraft` + `reviewRequests` (reviewer objects → flatten `.login`, dropping teams); `glab mr list --output json` exposes `draft` + `reviewers[].username` (same command, just add jq); but `tea pulls list` exposes *neither* (its JSON is limited to the selectable `--fields`, which omit draft/reviewers — the data exists only via raw `tea api`). Populate where the existing command can; default safely (`[]`/`false`) where it can't, and document the verified reason in the script. Defaulting a field the CLI *does* expose silently drops working data; rewriting a script onto a different command (e.g. `tea api`) to reach an unavailable field is a separate, larger change. Make the new required fields safe end-to-end by defaulting at the server mapping (`?? []` / `?? false`) so a non-conforming forge degrades rather than emitting `undefined`. --- diff --git a/codev/reviews/1047-tower-terminals-architects-bui.md b/codev/reviews/1047-tower-terminals-architects-bui.md new file mode 100644 index 000000000..b0bd207ec --- /dev/null +++ b/codev/reviews/1047-tower-terminals-architects-bui.md @@ -0,0 +1,78 @@ +# PIR Review: Terminal freeze + blank-on-open (oversized-replay storm and missing connect-time redraw) + +Fixes #1047 + +## Summary + +Two client/server terminal bugs were fixed. (1) **Freeze:** a full-screen TUI (Claude's UI) redraws in place and emits almost no newlines, so Tower's reconnection buffers — bounded by *line count* — grew unbounded; the resulting multi-MB replay overflowed the VSCode client's 1 MB receive budget, and the client's "disconnect-and-reconnect for replay" response re-fetched the same oversized snapshot in a tight loop (~14k cycles/hour), pegging one CPU core and starving every terminal. (2) **Blank-on-open:** a freshly-attached terminal could render a blank pane until a manual window resize, because the connect-time resize could be a same-size no-op and a full-screen TUI only repaints on a size *change*. Tower now scans only new output (killing the O(n²) re-scan) and brackets replay with `pause`/`resume` so it is paced and excluded from backpressure; the client drops live output under overload instead of reconnecting, reconnects with a `?resume=` delta *for newline-bearing streams*, and forces a guaranteed redraw shortly after connect (mirroring the web dashboard). **Scope note:** the plan's byte-addressable seq and shellper-side byte cap (Fix B) were deliberately descoped during dev-approval — byte-trimming the buffer corrupts a full-screen TUI's replay (its alt-screen state lives in the cumulative stream), so `partial`/replay are kept whole. Consequence: for a pure no-newline TUI stream, a reconnect falls back to full (faithful) replay rather than a delta, and the shellper restart-replay is line-bounded only (unbounded for no-newline output — a memory trade-off the issue itself rated minor/orthogonal, now observable via the new partial-size monitor). The post-connect repaint nudge covers correctness in both cases. + +## Files Changed + +- `packages/codev/src/terminal/ring-buffer.ts` (+39 / -…) — `pushData` scans only the incoming chunk (O(|data|), not O(|partial|)); `partial` kept whole for faithful replay; `partialBytes` getter for observability. +- `packages/codev/src/terminal/pty-session.ts` (+14) — `attachShellper` idempotency (drop prior client listeners on re-attach); `partialBytes` passthrough. +- `packages/codev/src/terminal/pty-manager.ts` (+30 / -…) — `createSessionRaw` tears down a colliding session before overwrite; `inspectPartials()` for the monitor. +- `packages/codev/src/agent-farm/servers/tower-websocket.ts` (+8 / -…) — bracket replay with `pause`/`resume` control frames. +- `packages/codev/src/agent-farm/servers/tower-server.ts` (+32) — periodic ring-buffer-partial monitor (observability). +- `packages/vscode/src/terminal-adapter.ts` (+124 / -…) — replay excluded from backpressure; live overload **drops** instead of reconnecting; `?resume=` delta reconnect; post-connect repaint nudge. +- `packages/codev/src/terminal/__tests__/ring-buffer.test.ts` (+25), `packages/codev/src/agent-farm/__tests__/tower-websocket.test.ts` (+14 / -…), `packages/vscode/src/__tests__/terminal-adapter.test.ts` (+137 / -…) — unit coverage. +- `packages/vscode/CHANGELOG.md` (+7) — user-facing entries for both fixes. +- `codev/resources/arch.md` (+4), `codev/resources/lessons-learned.md` (+3) — see sections below. + +(`git diff --stat`: 12 files, +410 / -27, excluding plan/thread/status artifacts.) + +## Commits + +- `8868ac86` [PIR #1047] Fix A+E: byte-cap RingBuffer.partial (scan-only pushData) + reconnect listener hygiene +- `7d42dac9` [PIR #1047] Fix B: byte-cap ShellperReplayBuffer (bounds no-newline replay across restart) +- `f7e6b3a2` [PIR #1047] Fix C+D: pause/resume replay bracket, drop-not-reconnect on live overload, resume-delta reconnect +- `9f282745` [PIR #1047] Instrumentation: log terminal ring-buffer partial sizes on a Tower monitor interval +- `757a75a5` [PIR #1047] Update tower-websocket test for pause/resume replay bracket +- `2f5e6351` [PIR #1047] docs: user-facing CHANGELOG entry for terminal-freeze fix +- `b05b73f2` [PIR #1047] Hoist replay-cap resolution out of constructor args; document the three distinct size limits +- `02ff5546` [PIR #1047] Remove byte caps: keep partial/replay whole for faithful TUI replay +- `7100a5a3` [PIR #1047] Re-assert terminal size after replay (Option A) + temporary diag logging +- `81c4a83b` [PIR #1047] Force a post-connect redraw nudge (mirror web client's SIGWINCH-on-connect); revert Option A +- `6d994f72` [PIR #1047] docs: changelog entry for terminal blank-on-open fix +- `cc7fbdc7` [PIR #1047] Strip temporary [#1047-diag] diagnostic logging + +(Plus the review/test-coverage commit carrying this file.) + +## Test Results + +- `pnpm build`: ✓ pass +- `pnpm --filter @cluesmith/codev test`: ✓ 3308 passed / 0 failed / 48 skipped +- `packages/vscode` adapter suite: ✓ 20 passed (including 5 storm-prevention + 2 repaint-nudge tests new for #1047); full VSCode suite previously green at 414. +- Manual verification (human, `dev-approval` gate, macOS VSCode against the worktree Tower): confirmed via the temporary `[#1047-diag]` client log that there is no connect→backpressure→reconnect storm, `seq` heartbeats stay clean, and — after the nudge landed — terminals **paint on open without a manual resize**. + +## Architecture Updates + +**COLD (`codev/resources/arch.md`)** — added a "Terminal reconnect/replay contract (#1047)" subsection under Shellper Lifecycle documenting the now-load-bearing behaviors: replay bracketed by `pause`/`resume` (excluded from client backpressure), `?resume=` delta reconnect, the deliberately-unbounded `partial` (faithful TUI replay) with CPU bounded by scan-only `pushData`, the client's connect-time forced redraw, and drop-not-reconnect under live overload. + +No HOT (`arch-critical.md`) change: these are subsystem-level protocol details, not a top-tier always-injected system-shape fact, and the hot file is at its cap. + +## Lessons Learned Updates + +**COLD (`codev/resources/lessons-learned.md`, Debugging section)** — three `[From 1047]` entries: (1) cross-client differential diagnosis — a symptom in one client but not another sharing the backend localizes the bug to the client layer (web rendered fine → the app *was* painting → the VSCode adapter was at fault), and a full-screen TUI only repaints on a size *change*; (2) a buffer bounded by newline count is unbounded for a newline-free stream, with the on-disk PTY logs ("longest run without `\n`") as the decisive evidence, and the caution that front-trimming a buffer corrupts TUI replay; (3) a backpressure relief valve that re-fetches the overflowing payload is an infinite loop — drop ephemeral output, don't reconnect. + +No HOT (`lessons-critical.md`) change: these are debugging recipes, not a top-tier cross-cutting rule, and the hot file is at its cap. + +## Things to Look At During PR Review + +- **The dropped byte caps (history).** Early commits added byte caps to `RingBuffer.partial` and `ShellperReplayBuffer`; they were **reverted** (`02ff5546`) after dev-approval testing showed front-trimming corrupts a full-screen TUI's replay (its alt-screen state lives in the cumulative stream from the alt-screen-enter onward). The freeze is fixed by *scan-only* `pushData` + the replay bracket, not by bounding the buffer. Net effect: `partial`/replay are functionally back to `main`'s faithful-but-unbounded behavior; the new `tower-server.ts` monitor logs partial size so unbounded growth (a known, issue-rated-minor trade-off) is observable. +- **The repaint nudge mechanism** (`terminal-adapter.ts`, `scheduleRepaintNudge`). It is a `rows-1` → `rows` size delta 500ms after connect, **gated** on `renderedSinceConnect` so a reconnect that already painted via replay doesn't reflow. The brief 1-row intermediate frame is intentional (guarantees a real `TIOCSWINSZ` delta even when the PTY is already at the target size). Confirm the gating and that it's cleared on close/reconnect (`resetStreamState`). +- **`#737` and the nudge are kept separate** (deliberate): `#737`'s on-open resize sets the correct size immediately (no flicker, common case); the nudge is the delayed backstop for the same-size no-op case. They're complementary, not redundant. +- **CMAP disposition (single-pass — please scrutinize at this gate).** 3-way consult: Codex `REQUEST_CHANGES`, Claude `APPROVE`, Gemini failed (consult ran against an empty sandbox and returned a clarification prompt, not a review — disregarded). Codex's two substantive points are the *deliberately descoped* byte-addressable seq and shellper byte cap (Fix B): both were dropped during dev-approval because byte-trimming a buffer corrupts a full-screen TUI's faithful replay (the regression the human caught when terminals went blank) — re-adding them would reintroduce that regression, and the issue itself rated the resulting unbounded memory as minor/orthogonal. So those are **rebutted, not fixed** (full rationale in `codev/projects/1047-*/1047-review-iter1-rebuttals.md`). The one point both reviewers raised that *was* legitimate — the review/comments overstating the `?resume=` delta as universal — is **fixed**: the Summary now scopes it to newline-bearing streams, and `RingBuffer.getSince` carries a comment documenting the no-newline behavior + nudge mitigation. Claude's Fix E test gap is **closed** (`pty-session-attach.test.ts`). Because PIR will not re-review, the human at this gate is the final check on whether the seq/Fix-B descope is acceptable. + +## How to Test Locally + +- **View diff**: VSCode sidebar → right-click builder `pir-1047` → **Review Diff**. +- **Run dev server**: VSCode sidebar → **Run Dev Server**, or `afx dev pir-1047`. +- **What to verify**: + - Open an architect/builder terminal running Claude's full-screen UI and let it redraw past ~1 MB of no-newline output; Tower CPU stays low and there is **no** connect→backpressure→reconnect storm (contrast the original ~14k-cycle capture). + - Open a terminal that previously blanked — it **paints on open** without a manual window resize. + - Reconnect a tab (close/reopen) → clean repaint; the reconnect URL carries `?resume=` (delta). + - Optional: leave Tower running for hours on a real workload and confirm CPU stays flat (the true end-to-end soak; cannot fit a gate session). + +## Flaky Tests + +None skipped. (Earlier full-suite runs showed environmental failures in `session-manager`/`adopt`/`hot-tier` integration tests when `dist/` and the copied skeleton were absent; a full `pnpm build` resolves them — they are not related to this change and pass after build.) diff --git a/codev/state/pir-1047_thread.md b/codev/state/pir-1047_thread.md new file mode 100644 index 000000000..71bb36311 --- /dev/null +++ b/codev/state/pir-1047_thread.md @@ -0,0 +1,110 @@ +# PIR #1047 — Tower terminals freeze until restart + +## Phase: plan (in progress) + +### Investigation summary +- Symptom (from issue + 4 diagnostic comments): CPU climbs linearly with uptime (0.6% → 93% over ~10h), ALL PTY terminals freeze, restart fixes it, memory grows only +76MB (minor), FD growth glacial. SSE clients churn/marked-dead is a *secondary* symptom of CPU starvation (and SSE 5-min max-age eviction is by-design, not the bug). +- Leak-shape: CPU-bound work proportional to a growing in-memory collection, iterated on a per-data-frame or per-tick hot path. Regression window points at PIR #991 (terminal reconnect / successor-session resolution). +- Comments already ruled out: cron firing (bursty, bounded), the SSE pump itself (well-defended), GC-stall/memory mechanism (RSS minor). + +### Code findings (grounded) +- `PtySession.onPtyData` (pty-session.ts:251-281) is the per-data-frame hot path: pushes ring buffer, writes disk log, iterates `this.clients`, emits `'data'`. The shellper `client.on('data')` fan-out (attachShellper, pty-session.ts:142) feeds it. +- HAZARD 1: `createSessionRaw` (pty-manager.ts:126-163) does `this.sessions.set(id, session)` with NO teardown of a pre-existing entry under the same id. Its safety comment ("map is empty at reconcile time, can't collide") only holds at startup reconcile, NOT the on-the-fly reconnect path (tower-terminals.ts:923) which runs live. +- HAZARD 2: `attachShellper` (pty-session.ts:119-197) adds `data`/`exit`/`close` listeners to the shellper client with no prior-listener removal; nothing explicitly detaches the OLD PtySession from its OLD client on reconnect (relies on GC, which fails if the old client stays referenced/alive). +- On-the-fly reconnect (tower-terminals.ts:860-958) is guarded by `!session`, so it fires once per lost-session window (NOT every poll — the subagent over-claimed "every poll"). Exact trigger for repeated accumulation is not confirmable by static read alone. + +### Plan direction (initial draft — superseded) +First draft led with an EventEmitter listener-leak theory + instrumentation. Demoted after empirical work below. + +### ROOT CAUSE CONFIRMED (empirical, 2026-06-15) +Architect pushed back: do we have a real hypothesis, and is restart actually the fix? Investigated the live state + on-disk PTY logs. + +- Tower currently DOWN; 12 shellpers persist (by design). +- Measured `~/.agent-farm/logs/*.log` (raw PTY output). Incident-window log `f02bedcb…` (14 Jun 21:01, 15 MB) has **0 newlines** — longest no-newline run = the entire 14.57 MB. Byte census: 1.5M ESC, 164K CR, 0 LF → a Claude full-screen TUI redraw stream (`\e[?1049h \e[2J \e[H … \e[3G \r`). Normal sessions: ~20 bytes/line, longest line ~5-7KB. +- Root cause: two buffers bound by NEWLINE COUNT, never bound for a no-newline TUI stream: + 1. `RingBuffer.pushData` (ring-buffer.ts:36-51): `partial + data` then `split('\n')` every frame → O(partial) per frame, O(n²)/session, partial unbounded. No byte cap (only the stderr ring at session-manager.ts:67 has maxLineLength=10000). + 2. `ShellperReplayBuffer.append` (shellper-replay-buffer.ts:45,58): evicts only when lineCount>maxLines → 0 newlines → never evicts → unbounded; fed by shellper-process.ts:143. +- Explains ALL symptoms incl. the two the listener theory couldn't: CPU-without-memory, and **restart-unreliability** (shellper replay is also unbounded → restart re-seeds a 15 MB partial → O(n²) resumes unless the heavy session is gone). That is the answer to "not sure restart resolves it." + +### Fix (plan rewritten) +Primary: (1) RingBuffer.pushData scan only `data` + byte-cap `partial` (front-trim); (2) ShellperReplayBuffer byte cap. Secondary/optional: listener hygiene (attachShellper idempotent, createSessionRaw teardown). Plus targeted instrumentation (log partial/replay sizes on the 30s heartbeat). Accelerated unit tests compress the leak into CI; replay-correctness test guards no-regression. + +Plan rewritten in codev/plans/1047-tower-terminals-architects-bui.md — still at plan-approval gate, awaiting review. Two open questions for reviewer: cap sizes, and whether to include listener-hygiene here or split it out. + +### CLINCHER: client-side replay storm (2026-06-15, from VSCode log) +Architect supplied tmp/vscode-log.txt (42,143 lines, 58 min) + screenshot. Storm confirmed: +- 14,026 "WebSocket connected" / 14,015 "Backpressure exceeded 1MB" / 14,026 "Connecting to" (1:1:1). +- 14,017 reconnects target ONE terminal: f2dc55d1. Its disk log = 1.9 MB, 0 newlines → replay 1.9 MB > client MAX_QUEUE (1 MB). +- Client handleData (terminal-adapter.ts:300-308): queuedBytes += payload; if >1MB → reconnect(). reconnect() (281-296) does backoff.reset() → instant retry, no backoff. Tower re-sends the 1.9 MB replay (tower-websocket.ts:62-65 getAll() = [partial]) → infinite loop. +- This replay storm (Tower re-serializing multi-MB replay ~14k×/hr on the single event loop) is the DOMINANT CPU driver; O(n²) split is secondary. Only ~1 MB of no-newline output is needed to trigger. + +### Plan now has 3 coordinated fixes +A: RingBuffer.pushData scan-only-data + byte-cap partial (< MAX_QUEUE). B: ShellperReplayBuffer byte cap. C: client backpressure path must not infinite-loop (backoff + re-trip guard; ideally pause/resume bracket so replay isn't counted as live backpressure). Plus instrumentation. Still at plan-approval gate. + +### Scope decision (2026-06-15): "do all together" — remote-resilient full fix +Architect asked for the most resilient approach given Tower may be hosted locally OR remotely (cloud/tunnel). Folded ALL into one coordinated PR: +- A: RingBuffer scan-only + byte-cap partial + **byte-addressable seq/getSince** (so resume works for no-newline streams). +- B: ShellperReplayBuffer byte cap. +- C: client redesign — **delta reconnect (?resume=lastSeq)** + replay excluded from backpressure (pause/resume bracket) + **live overload DROPS not reconnects** (mirror Tower bufferedAmount drop) + keep backoff/give-up safety net. +- D: Tower brackets replay with pause/resume + honors resume via byte-aware getSince. +- E: listener hygiene folded in (attachShellper idempotent, createSessionRaw teardown) — matters more remotely (frequent reconnects). +- Caps env-configurable; Fix C removes the hard MAX_PARTIAL_BYTES { expect(session.attach).not.toHaveBeenCalled(); }); - it('sends replay data and seq frame on connect', () => { + it('brackets replay with pause/resume and sends a seq frame on connect (#1047)', () => { const ws = makeWs(); const session = makeSession(5); session.attach.mockReturnValue(['line1', 'line2']); @@ -136,8 +136,16 @@ describe('tower-websocket', () => { handleTerminalWebSocket(ws, session, req); - // Should send replay data + seq control frame - expect(ws.send).toHaveBeenCalledTimes(2); + // pause control + replay data + resume control + seq control = 4 frames. + // The pause/resume bracket lets the client exclude the (potentially large) + // replay snapshot from its live-backpressure budget (#1047). + expect(ws.send).toHaveBeenCalledTimes(4); + + const controlTypes = ws.send.mock.calls + .map((call: unknown[]) => call[0] as Buffer) + .filter((buf: Buffer) => buf[0] === 0x00) + .map((buf: Buffer) => JSON.parse(buf.subarray(1).toString('utf-8')).type); + expect(controlTypes).toEqual(['pause', 'resume', 'seq']); }); it('sends only seq frame when no replay lines', () => { diff --git a/packages/codev/src/agent-farm/servers/tower-server.ts b/packages/codev/src/agent-farm/servers/tower-server.ts index 808e97133..603113243 100644 --- a/packages/codev/src/agent-farm/servers/tower-server.ts +++ b/packages/codev/src/agent-farm/servers/tower-server.ts @@ -63,6 +63,15 @@ const rateLimitCleanupInterval = startRateLimitCleanup(); // Shellper session manager (initialized at startup) let shellperManager: SessionManager | null = null; let shellperCleanupInterval: NodeJS.Timeout | null = null; +let terminalPartialMonitorInterval: NodeJS.Timeout | null = null; + +// Observability for Issue #1047: the ring-buffer partial is kept whole (no +// byte cap) so reconnection replay stays faithful, which means a no-newline +// full-screen TUI grows it without bound. This monitor surfaces that growth so +// we can tell whether it's ever a real memory concern. Logged periodically; +// warned past this size. +const PARTIAL_WARN_BYTES = 4 * 1024 * 1024; +const TERMINAL_MONITOR_INTERVAL_MS = 60_000; // Parse arguments with Commander const program = new Command() @@ -154,6 +163,7 @@ async function gracefulShutdown(signal: string): Promise { // 4. Stop rate limit cleanup, shellper periodic cleanup, and SSE heartbeat clearInterval(rateLimitCleanupInterval); if (shellperCleanupInterval) clearInterval(shellperCleanupInterval); + if (terminalPartialMonitorInterval) clearInterval(terminalPartialMonitorInterval); clearInterval(sseHeartbeatInterval); // 4b. Flush and stop send buffer (Spec 403) — delivers any deferred messages @@ -391,6 +401,28 @@ server.listen(port, bindHost, async () => { getKnownWorkspacePaths, }); + // Issue #1047 observability: periodically report terminal ring-buffer + // partial sizes so a no-newline TUI stream (the freeze trigger) is visible + // in the Tower log. Cheap (O(sessions) once a minute) and .unref()'d so it + // never holds the process open. + terminalPartialMonitorInterval = setInterval(() => { + try { + const partials = getTerminalManager().inspectPartials(); + if (partials.length === 0) return; + let maxBytes = 0; + for (const p of partials) { + if (p.partialBytes > maxBytes) maxBytes = p.partialBytes; + if (p.partialBytes >= PARTIAL_WARN_BYTES) { + log('WARN', `Terminal ${p.id} (${p.label}) ring-buffer partial at ${Math.round(p.partialBytes / 1024)} KB — large no-newline TUI stream growing unbounded (#1047)`); + } + } + log('INFO', `Terminal partial monitor: ${partials.length} session(s), max partial ${Math.round(maxBytes / 1024)} KB`); + } catch (err) { + log('ERROR', `Terminal partial monitor failed: ${(err as Error).message}`); + } + }, TERMINAL_MONITOR_INTERVAL_MS); + terminalPartialMonitorInterval.unref(); + // Spec 403: Start send buffer for typing-aware message delivery startSendBuffer(log); diff --git a/packages/codev/src/agent-farm/servers/tower-websocket.ts b/packages/codev/src/agent-farm/servers/tower-websocket.ts index 1c57054cd..d6c6170cf 100644 --- a/packages/codev/src/agent-farm/servers/tower-websocket.ts +++ b/packages/codev/src/agent-farm/servers/tower-websocket.ts @@ -58,11 +58,17 @@ export function handleTerminalWebSocket(ws: WebSocket, session: PtySession, req: replayLines = session.attach(client); } - // Send replay data as binary data frame + // Send replay data as binary data frame, bracketed by pause/resume control + // frames (#1047). The bracket tells the client "this is the one-shot buffer + // snapshot" so it paces the write and excludes it from its live-backpressure + // budget. Without it, a client counts a large replay as live overload and + // (historically) looped forever reconnecting for the same oversized replay. if (replayLines.length > 0) { const replayData = replayLines.join('\n'); if (ws.readyState === WebSocket.OPEN) { + ws.send(encodeControl({ type: 'pause', payload: {} })); ws.send(encodeData(replayData)); + ws.send(encodeControl({ type: 'resume', payload: {} })); } } diff --git a/packages/codev/src/terminal/__tests__/pty-session-attach.test.ts b/packages/codev/src/terminal/__tests__/pty-session-attach.test.ts new file mode 100644 index 000000000..ff6c6583d --- /dev/null +++ b/packages/codev/src/terminal/__tests__/pty-session-attach.test.ts @@ -0,0 +1,79 @@ +import { describe, it, expect } from 'vitest'; +import { EventEmitter } from 'node:events'; +import { PtySession, type PtySessionConfig } from '../pty-session.js'; +import type { IShellperClient } from '../shellper-client.js'; + +/** + * Fix E (#1047): attachShellper must be idempotent — re-attaching a session to + * a new shellper client must drop the listeners it installed on the previous + * client, so a reconnect can't accumulate duplicate `data` listeners that each + * re-run onPtyData for every PTY byte (the listener-leak the hardening targets). + */ + +function makeFakeClient(): IShellperClient { + const emitter = new EventEmitter() as unknown as IShellperClient & { _lastDataAt: number }; + // attachShellper reads client.lastDataAt and subscribes data/exit/close. + Object.defineProperty(emitter, 'lastDataAt', { get: () => Date.now() }); + return emitter; +} + +function makeSession(): PtySession { + const config: PtySessionConfig = { + id: 'sess-1', + command: '', + args: [], + cols: 80, + rows: 24, + cwd: '/tmp', + env: {}, + label: 'test', + logDir: '/tmp', + diskLogEnabled: false, // avoid touching the filesystem + }; + return new PtySession(config); +} + +describe('PtySession.attachShellper idempotency (#1047 Fix E)', () => { + it('removes listeners from the previous client on re-attach', () => { + const session = makeSession(); + const clientA = makeFakeClient(); + const clientB = makeFakeClient(); + + session.attachShellper(clientA, Buffer.alloc(0), 1234); + expect(clientA.listenerCount('data')).toBe(1); + + // Re-attach to a new client (a reconnect): A's listeners must be dropped. + session.attachShellper(clientB, Buffer.alloc(0), 5678); + expect(clientA.listenerCount('data')).toBe(0); + expect(clientA.listenerCount('exit')).toBe(0); + expect(clientA.listenerCount('close')).toBe(0); + expect(clientB.listenerCount('data')).toBe(1); + }); + + it('a data frame on the detached client no longer reaches the ring buffer', () => { + const session = makeSession(); + const clientA = makeFakeClient(); + const clientB = makeFakeClient(); + + session.attachShellper(clientA, Buffer.alloc(0), 1234); + session.attachShellper(clientB, Buffer.alloc(0), 5678); + + const seqBefore = session.ringBuffer.currentSeq; + clientA.emit('data', Buffer.from('stale\n', 'utf-8')); // from the old client + expect(session.ringBuffer.currentSeq).toBe(seqBefore); // ignored + + clientB.emit('data', Buffer.from('fresh\n', 'utf-8')); // from the live client + expect(session.ringBuffer.currentSeq).toBe(seqBefore + 1); + }); + + it('does not drop listeners when re-attaching the same client instance', () => { + const session = makeSession(); + const clientA = makeFakeClient(); + + session.attachShellper(clientA, Buffer.alloc(0), 1234); + session.attachShellper(clientA, Buffer.alloc(0), 1234); // same instance + // Guard only fires for a *different* client, so this is a no-op re-subscribe + // path; the session stays attached and functional. + expect(clientA.listenerCount('data')).toBeGreaterThanOrEqual(1); + }); +}); diff --git a/packages/codev/src/terminal/__tests__/ring-buffer.test.ts b/packages/codev/src/terminal/__tests__/ring-buffer.test.ts index 7bbbd269c..d2e226a6d 100644 --- a/packages/codev/src/terminal/__tests__/ring-buffer.test.ts +++ b/packages/codev/src/terminal/__tests__/ring-buffer.test.ts @@ -113,6 +113,31 @@ describe('RingBuffer', () => { expect(buf.getAll()).toEqual(['hello', '', 'world']); }); + it('keeps a no-newline stream whole for faithful replay (Issue #1047)', () => { + const buf = new RingBuffer(10); + // 100 KB with no newline, in 1 KB frames — mimics a full-screen TUI that + // redraws in place and never emits \n. The whole stream must be preserved + // (not truncated) so a reconnection replay can reconstruct the screen. + const frame = 'x'.repeat(1024); + for (let i = 0; i < 100; i++) { + buf.pushData(frame); + } + expect(buf.size).toBe(0); // no complete lines + const all = buf.getAll(); + expect(all).toHaveLength(1); + expect(all[0].length).toBe(100 * 1024); // full content retained, not capped + expect(buf.partialBytes).toBe(100 * 1024); + }); + + it('partialBytes reports the held incomplete-line size', () => { + const buf = new RingBuffer(10); + expect(buf.partialBytes).toBe(0); + buf.pushData('abc'); + expect(buf.partialBytes).toBe(3); + buf.pushData('def\n'); // completes the line, clears partial + expect(buf.partialBytes).toBe(0); + }); + it('clear resets content but keeps seq', () => { const buf = new RingBuffer(5); buf.push('a'); diff --git a/packages/codev/src/terminal/pty-manager.ts b/packages/codev/src/terminal/pty-manager.ts index dd649e9b3..dcfeeadb3 100644 --- a/packages/codev/src/terminal/pty-manager.ts +++ b/packages/codev/src/terminal/pty-manager.ts @@ -120,8 +120,12 @@ export class TerminalManager { * a fresh one. Reconnect-after-restart passes the persisted id so a terminal * keeps its identity across a Tower restart (#991): the client's WebSocket url * (`/ws/terminal/`) stays valid, so the existing reconnect machinery - * re-attaches transparently rather than hitting a dead id. The in-memory - * sessions map is empty at reconcile time, so reusing the id can't collide. + * re-attaches transparently rather than hitting a dead id. At startup reconcile + * the in-memory sessions map is empty so reuse can't collide; on the *live* + * on-the-fly reconnect path (callers guard on a missing session) it normally + * can't either, but if an entry already exists under this id we tear it down + * first (Issue #1047 Fix E) so a replaced session can't keep firing listeners + * on the surviving shellper client. */ createSessionRaw(opts: { label: string; cwd: string; id?: string }): PtySessionInfo { if (this.sessions.size >= this.config.maxSessions) { @@ -129,6 +133,13 @@ export class TerminalManager { } const id = opts.id ?? randomUUID(); + const existing = this.sessions.get(id); + if (existing) { + // Defensive teardown before overwrite: detach the old session from its + // shellper client so its data/exit/close listeners stop processing on + // the client we're about to re-attach to a fresh session. + existing.detachShellper(); + } const { cols, rows } = defaultSessionOptions(); const sessionConfig: PtySessionConfig = { id, @@ -167,6 +178,21 @@ export class TerminalManager { return Array.from(this.sessions.values()).map(s => s.info); } + /** + * Snapshot of each session's ring-buffer partial size and client count + * (Issue #1047 observability). The partial is unbounded (kept whole for + * faithful replay); a large, growing partial flags a no-newline full-screen + * TUI stream, so this surfaces memory growth if it ever becomes a concern. + */ + inspectPartials(): Array<{ id: string; label: string; partialBytes: number; clients: number }> { + return Array.from(this.sessions.values()).map(s => ({ + id: s.id, + label: s.label, + partialBytes: s.partialBytes, + clients: s.clientCount, + })); + } + /** Get a session by ID. */ getSession(id: string): PtySession | undefined { return this.sessions.get(id); diff --git a/packages/codev/src/terminal/pty-session.ts b/packages/codev/src/terminal/pty-session.ts index 27798c5bd..f282ee8d9 100644 --- a/packages/codev/src/terminal/pty-session.ts +++ b/packages/codev/src/terminal/pty-session.ts @@ -117,6 +117,15 @@ export class PtySession extends EventEmitter { * User input flows: WebSocket → write() → shellper. */ attachShellper(client: IShellperClient, replayData: Buffer, shellperPid: number, shellperSessionId?: string): void { + // Idempotent re-attach (Issue #1047 Fix E): if a previous client is still + // attached, drop our listeners on it before subscribing to the new one so + // a re-attach can't double the per-byte data fan-out (each leaked 'data' + // listener would re-run onPtyData for every PTY byte). + if (this.shellperClient && this.shellperClient !== client) { + this.shellperClient.removeAllListeners('data'); + this.shellperClient.removeAllListeners('exit'); + this.shellperClient.removeAllListeners('close'); + } this._shellperBacked = true; this.shellperClient = client; this.shellperPid = shellperPid; @@ -410,6 +419,11 @@ export class PtySession extends EventEmitter { return this.clients.size; } + /** Bytes held in the ring buffer's incomplete-line partial (observability, #1047). */ + get partialBytes(): number { + return this.ringBuffer.partialBytes; + } + /** Record that a user sent input to this session. */ recordUserInput(): void { this._lastInputAt = Date.now(); diff --git a/packages/codev/src/terminal/ring-buffer.ts b/packages/codev/src/terminal/ring-buffer.ts index 6b966a9de..f66c26b01 100644 --- a/packages/codev/src/terminal/ring-buffer.ts +++ b/packages/codev/src/terminal/ring-buffer.ts @@ -31,23 +31,33 @@ export class RingBuffer { * chunk boundaries: if data doesn't end with \n, the trailing fragment * is held and prepended to the next pushData call. * + * Scans only the incoming `data` for newlines (never re-splits the whole + * accumulated `partial + data`), so per-call work is O(|data|) rather than + * O(|partial|) — the O(n²) re-scan that pegged Tower's CPU on no-newline + * full-screen-TUI streams (Issue #1047). The `partial` is kept whole and + * unbounded so a reconnection replay faithfully reconstructs the screen: a + * TUI in the alternate screen buffer encodes its state in the cumulative + * byte stream from the alt-screen-enter onward, so truncating the front + * would corrupt the replay (the app won't repaint on a same-size reconnect). + * * Returns last sequence number. */ pushData(data: string): number { - const combined = this.partial + data; - const parts = combined.split('\n'); - - // Last element is either: - // - "" if data ended with \n (all lines complete) - // - non-empty if data ended mid-line (incomplete line) - // Either way, save it as the new partial. - this.partial = parts.pop()!; + let start = 0; + let nl = data.indexOf('\n'); + while (nl !== -1) { + // Complete line = held partial (if any) + this segment up to the newline. + this.push(this.partial + data.slice(start, nl)); + this.partial = ''; + start = nl + 1; + nl = data.indexOf('\n', start); + } - let lastSeq = this.seq; - for (const line of parts) { - lastSeq = this.push(line); + // Remainder has no newline — append to the partial (cons-string, O(|data|)). + if (start < data.length) { + this.partial += data.slice(start); } - return lastSeq; + return this.seq; } /** Get all stored lines in order, including any incomplete trailing line. */ @@ -62,7 +72,20 @@ export class RingBuffer { return result; } - /** Get lines starting from a given sequence number (for resume). */ + /** + * Get lines starting from a given sequence number (for resume). + * + * Note (#1047): `seq` advances only on completed (newline-terminated) lines. + * A full-screen TUI emits no newlines, so for such a session `seq` stays at + * whatever the last real line was and a client that is caught up to it gets + * `[]` here — the in-progress `partial` (the current screen) is NOT replayed + * on a delta resume. That gap is covered by the client's post-connect repaint + * nudge, which forces the app to redraw on (re)connect (see + * `terminal-adapter.ts`). True byte-granular resume for no-newline streams was + * considered and deliberately descoped (it would require a byte-addressable + * seq and breaks the existing line-based wire contract); the nudge makes it + * unnecessary for correctness. + */ getSince(sinceSeq: number): string[] { const linesAvailable = this.count; const oldestSeq = this.seq - linesAvailable + 1; @@ -90,6 +113,11 @@ export class RingBuffer { return this.count; } + /** Bytes held in the incomplete-line partial (observability, #1047). */ + get partialBytes(): number { + return this.partial.length; + } + /** Clear the buffer and release memory. */ clear(): void { this.buffer = []; diff --git a/packages/vscode/CHANGELOG.md b/packages/vscode/CHANGELOG.md index 3b8b7984a..301a9879d 100644 --- a/packages/vscode/CHANGELOG.md +++ b/packages/vscode/CHANGELOG.md @@ -1,5 +1,12 @@ # Change Log +## [Unreleased] + +### Bug fixes + +- **Terminals no longer freeze over time until you restart Tower.** After hours of use, architect and builder terminals could stop rendering output all at once, with a Tower restart the only (and unreliable) fix. The cause was a full-screen TUI (such as the Claude Code prompt) that redraws in place and emits very little newline-terminated output: Tower re-scanned its entire (ever-growing) reconnection buffer on every chunk of output, and the oversized snapshot it sent on connect blew past the terminal's 1 MB receive budget, so the terminal disconnected, reconnected, received the same oversized snapshot, and looped — thousands of times a minute — pegging Tower's CPU and starving every terminal. Tower now processes only each new chunk instead of re-scanning the whole buffer, the reconnection snapshot is delivered in a paced way that no longer counts against the live backpressure budget, a reconnect now asks Tower only for what it missed instead of re-downloading the whole buffer (important when Tower runs remotely), and genuine live overload now drops ephemeral output (the terminal repaints) instead of reconnecting. +- **Terminals no longer open to a blank pane until you resize the window.** Opening an architect or builder terminal could show an empty pane with just a cursor, staying blank until you nudged the window size by hand. A full-screen terminal program only repaints when the terminal size changes, and the size the extension sent on connect could match what the program already had — so it never redrew. The extension now forces a redraw shortly after connecting if the pane is still empty, the same way the web dashboard already does. + ## [3.1.9] - 2026-06-08 Lockstep republish with `@cluesmith/codev@3.1.9` to keep CLI and extension versions aligned. No extension changes from 3.1.8 — see [3.1.8] below for the cycle's substantive content. The npm-side hotfix that triggered v3.1.9 didn't affect the extension; the v3.1.8 marketplace publish is functionally identical to v3.1.9. diff --git a/packages/vscode/src/__tests__/terminal-adapter.test.ts b/packages/vscode/src/__tests__/terminal-adapter.test.ts index 50e057cf5..2885014ee 100644 --- a/packages/vscode/src/__tests__/terminal-adapter.test.ts +++ b/packages/vscode/src/__tests__/terminal-adapter.test.ts @@ -39,13 +39,14 @@ vi.mock('ws', () => { readyState = 0; binaryType = ''; closed = false; + sent: unknown[] = []; private handlers: Record void>> = {}; constructor(public url: string) { FakeWebSocket.instances.push(this); } on(event: string, cb: (...args: unknown[]) => void): this { (this.handlers[event] ||= []).push(cb); return this; } - send(): void {} + send(data: unknown): void { this.sent.push(data); } close(): void { this.closed = true; } /** Test helper: synchronously dispatch a lifecycle event to handlers. */ emit(event: string, ...args: unknown[]): void { @@ -65,9 +66,35 @@ vi.mock('@cluesmith/codev-core/escape-buffer', () => ({ vi.mock('@cluesmith/codev-types', () => ({ FRAME_CONTROL: 0x00, FRAME_DATA: 0x01 })); +const FRAME_CONTROL = 0x00; +const FRAME_DATA = 0x01; + +/** Deliver a binary DATA frame to the adapter's `message` handler. */ +function sendData(socket: { emit(e: string, ...a: unknown[]): void }, payload: Buffer): void { + socket.emit('message', Buffer.concat([Buffer.from([FRAME_DATA]), payload])); +} + +/** Deliver a binary CONTROL frame (JSON) to the adapter's `message` handler. */ +function sendControl( + socket: { emit(e: string, ...a: unknown[]): void }, + msg: { type: string; payload: Record }, +): void { + const json = Buffer.from(JSON.stringify(msg), 'utf-8'); + socket.emit('message', Buffer.concat([Buffer.from([FRAME_CONTROL]), json])); +} + +/** Extract the resize control frames a fake socket has sent (#1047). */ +function sentResizes(socket: { sent: unknown[] }): Array<{ cols: number; rows: number }> { + return (socket.sent as Buffer[]) + .filter((b) => b[0] === FRAME_CONTROL) + .map((b) => JSON.parse(b.subarray(1).toString('utf-8')) as { type: string; payload: { cols: number; rows: number } }) + .filter((m) => m.type === 'resize') + .map((m) => m.payload); +} + // Imports AFTER mocks are registered. const WebSocket = (await import('ws')).default as unknown as { - instances: Array<{ closed: boolean; emit(e: string, ...a: unknown[]): void }>; + instances: Array<{ closed: boolean; readyState: number; sent: unknown[]; emit(e: string, ...a: unknown[]): void }>; OPEN: number; }; const { CodevPseudoterminal, RECONNECT_LINK_TEXT } = await import('../terminal-adapter.js'); @@ -294,3 +321,109 @@ describe('PIR #1001 — reconnect notices overwrite in place and clear on succes expect(writes[0]).toContain(RECONNECT_LINK_TEXT); }); }); + +describe('PIR #1047 — oversized replay storm prevention', () => { + function reconnectTo(pty: unknown, url: string): void { + (pty as { reconnect(u?: string): void }).reconnect(url); + } + + it('renders a >1MB replay (pause-bracketed) without reconnecting', () => { + const { pty, writes } = makeAdapter(); + currentSocket().emit('open'); + const socketsBefore = WebSocket.instances.length; + writes.length = 0; + + // Tower brackets the buffer snapshot: pause → (big replay) → resume. + sendControl(currentSocket(), { type: 'pause', payload: {} }); + sendData(currentSocket(), Buffer.alloc(2 * 1024 * 1024, 0x41)); // 2 MB + sendControl(currentSocket(), { type: 'resume', payload: {} }); + + // No reconnect (the old bug looped here ~14k times), and content rendered. + expect(WebSocket.instances.length).toBe(socketsBefore); + expect(writes.length).toBeGreaterThan(0); + expect(writes.join('')).not.toContain(RECONNECT_LINK_TEXT); // did not give up + void pty; + }); + + it('drops live output over the queue limit instead of reconnecting', () => { + const { writes } = makeAdapter(); + currentSocket().emit('open'); + const socketsBefore = WebSocket.instances.length; + writes.length = 0; + + // Not bracketed → treated as live. The old code called reconnect() here. + sendData(currentSocket(), Buffer.alloc(2 * 1024 * 1024, 0x42)); // 2 MB + + expect(WebSocket.instances.length).toBe(socketsBefore); // dropped, no reconnect + }); + + it('still renders normal-sized live output', () => { + const { writes } = makeAdapter(); + currentSocket().emit('open'); + writes.length = 0; + + sendData(currentSocket(), Buffer.from('hello world', 'utf-8')); + expect(writes.join('')).toContain('hello world'); + }); + + it('reconnect requests a resume delta from the last seq', () => { + const { writes } = makeAdapter(); + currentSocket().emit('open'); + writes.length = 0; + + sendControl(currentSocket(), { type: 'seq', payload: { seq: 42 } }); + currentSocket().emit('close'); + vi.advanceTimersByTime(1000); // scheduled reconnect opens a new socket + + expect((currentSocket() as unknown as { url: string }).url).toContain('resume=42'); + }); + + it('a successor-session reconnect resets seq and does a full (no-resume) replay', () => { + const { pty } = makeAdapter(); + currentSocket().emit('open'); + sendControl(currentSocket(), { type: 'seq', payload: { seq: 42 } }); + + reconnectTo(pty, 'ws://localhost:4100/successor'); + + const url = (currentSocket() as unknown as { url: string }).url; + expect(url).toContain('/successor'); + expect(url).not.toContain('resume'); + }); +}); + +describe('PIR #1047 — post-connect repaint nudge', () => { + /** Open the adapter with known dimensions and an OPEN socket so sends record. */ + function makeOpenAdapter(cols: number, rows: number) { + const { pty, writes } = makeAdapter(); + (pty as unknown as { setDimensions(d: { columns: number; rows: number }): void }) + .setDimensions({ columns: cols, rows: rows }); + const socket = currentSocket(); + socket.readyState = WebSocket.OPEN; + socket.emit('open'); + return { pty, writes, socket }; + } + + it('forces a redraw nudge (rows-1 then rows) when the pane stays blank after connect', () => { + const { socket } = makeOpenAdapter(100, 40); + socket.sent.length = 0; // drop the on-open resize; isolate the nudge + + vi.advanceTimersByTime(500); + + // The nudge is a real size delta then back: 100x39 then 100x40. + expect(sentResizes(socket)).toEqual([ + { cols: 100, rows: 39 }, + { cols: 100, rows: 40 }, + ]); + }); + + it('skips the nudge when output already rendered during the settle window', () => { + const { socket } = makeOpenAdapter(100, 40); + socket.sent.length = 0; + + // Live output arrives before the settle delay → pane is no longer blank. + sendData(socket, Buffer.from('hello', 'utf-8')); + vi.advanceTimersByTime(500); + + expect(sentResizes(socket).some((r) => r.rows === 39)).toBe(false); + }); +}); diff --git a/packages/vscode/src/terminal-adapter.ts b/packages/vscode/src/terminal-adapter.ts index e4579a766..ead0b187d 100644 --- a/packages/vscode/src/terminal-adapter.ts +++ b/packages/vscode/src/terminal-adapter.ts @@ -5,7 +5,9 @@ import { EscapeBuffer } from '@cluesmith/codev-core/escape-buffer'; import { BackoffController, classifyUpgradeError } from '@cluesmith/codev-core/reconnect-policy'; const CHUNK_SIZE = 16384; // 16KB — chunk onDidWrite to avoid CPU spikes -const MAX_QUEUE = 1048576; // 1MB — disconnect if queue exceeds this +const MAX_QUEUE = 1048576; // 1MB — drop live output if the unrendered queue exceeds this +const DROP_WARN_INTERVAL_MS = 5000; // throttle backpressure-drop warnings (#1047) +const REPAINT_NUDGE_DELAY_MS = 500; // settle delay before forcing a redraw-SIGWINCH (#1047), matching the web client's 500ms // Reconnect backoff. The exponential curve (1000 * 2^attempt, capped at 30s), // the give-up threshold, and the session-unknown classifier all live in the @@ -46,8 +48,20 @@ export class CodevPseudoterminal implements vscode.Pseudoterminal { // until the user happens to manually resize the panel (Bugfix #737). private lastDimensions: { cols: number; rows: number } | null = null; private queuedBytes = 0; + private lastDropWarnAt = 0; private disposed = false; + // Repaint-nudge state (#1047). A freshly-attached terminal can stay blank: + // the app inside (e.g. Claude's full-screen TUI) only paints after a real + // window-size change (SIGWINCH), but the resize we send on connect can be a + // same-size no-op or land before the app reacts. The web dashboard client + // avoids this by unconditionally sending a "redraw" resize ~500ms after + // connect (Terminal.tsx flushInitialBuffer); the Pseudoterminal path here had + // no equivalent. We mirror it: after a settle delay, if nothing has rendered, + // nudge the size (a brief 1-row change, then back) to force the SIGWINCH. + private renderedSinceConnect = false; + private repaintNudgeTimer: ReturnType | null = null; + // Reconnect-loop state. The adapter owns reconnection end-to-end (#936) — // backoff scheduling, give-up after MAX_RECONNECT_ATTEMPTS, and a terminal // failure state that stops the loop until the user manually reconnects. The @@ -82,6 +96,7 @@ export class CodevPseudoterminal implements vscode.Pseudoterminal { close(): void { this.disposed = true; + this.clearRepaintNudge(); if (this.reconnectTimer) { clearTimeout(this.reconnectTimer); this.reconnectTimer = null; @@ -127,11 +142,27 @@ export class CodevPseudoterminal implements vscode.Pseudoterminal { // ── WebSocket ──────────────────────────────────────────────── + /** + * Build the effective WebSocket URL, requesting a delta replay via + * `?resume=` when we've already received data (#1047). This ships + * only the bytes produced during the disconnect window instead of the whole + * buffer on every reconnect — the dominant win when Tower is hosted remotely + * and reconnects are frequent. A fresh connect (lastSeq 0) gets a full + * (bounded) replay, as does a switch to a successor session whose lastSeq + * was reset. + */ + private connectUrl(): string { + if (this.lastSeq <= 0) { return this.wsUrl; } + const sep = this.wsUrl.includes('?') ? '&' : '?'; + return `${this.wsUrl}${sep}resume=${this.lastSeq}`; + } + private connect(): void { if (this.disposed) { return; } - this.log('INFO', `Connecting to ${this.wsUrl}`); - const socket = new WebSocket(this.wsUrl); + const url = this.connectUrl(); + this.log('INFO', `Connecting to ${url}`); + const socket = new WebSocket(url); this.ws = socket; this.ws.binaryType = 'arraybuffer'; @@ -157,6 +188,10 @@ export class CodevPseudoterminal implements vscode.Pseudoterminal { if (this.lastDimensions) { this.sendResize(this.lastDimensions.cols, this.lastDimensions.rows); } + // Force a redraw-SIGWINCH after the connection settles if nothing has + // rendered (#1047), mirroring the web dashboard's post-connect resize. + this.renderedSinceConnect = false; + this.scheduleRepaintNudge(); }); this.ws.on('message', (raw: ArrayBuffer) => { @@ -271,6 +306,14 @@ export class CodevPseudoterminal implements vscode.Pseudoterminal { private resetStreamState(): void { this.decoder = new TextDecoder('utf-8', { fatal: false }); this.escapeBuffer = new EscapeBuffer(); + // Clear replay/backpressure state so a connection that dropped mid-replay + // (no `resume` control seen) doesn't leave `replaying` stuck true and + // mis-route the next connection's live data (#1047). + this.replaying = false; + this.queuedBytes = 0; + // Drop any pending repaint nudge from a prior connection; the next WS open + // reschedules it. + this.clearRepaintNudge(); } /** @@ -280,7 +323,13 @@ export class CodevPseudoterminal implements vscode.Pseudoterminal { */ reconnect(wsUrl?: string): void { if (this.disposed) { return; } - if (wsUrl) { this.wsUrl = wsUrl; } + if (wsUrl && wsUrl !== this.wsUrl) { + // A successor session (#991) is a different terminal id, so our lastSeq + // doesn't apply — reset it so connectUrl() requests a full (bounded) + // replay against the new session rather than a stale delta (#1047). + this.wsUrl = wsUrl; + this.lastSeq = 0; + } if (this.reconnectTimer) { clearTimeout(this.reconnectTimer); this.reconnectTimer = null; @@ -298,12 +347,29 @@ export class CodevPseudoterminal implements vscode.Pseudoterminal { // ── Data handling ──────────────────────────────────────────── private handleData(payload: Buffer): void { - // Backpressure check + // Replay burst (bracketed by Tower's pause→…→resume, #1047): the initial + // buffer snapshot is expected to be large and is delivered once. Pace it + // through the chunked writer and do NOT count it toward the live + // backpressure budget — otherwise a multi-MB replay trips the guard before + // anything renders and the old reconnect-for-replay path looped forever. + if (this.replaying) { + const text = this.decoder.decode(payload, { stream: true }); + const safe = this.escapeBuffer.write(text); + if (safe.length > 0) { this.renderedSinceConnect = true; this.writeChunked(safe); } + return; + } + this.renderedSinceConnect = true; + + // Live overload: if rendered output falls far enough behind that the queue + // exceeds MAX_QUEUE, DROP this burst rather than reconnecting. Terminal + // output is ephemeral — the app repaints — so dropping is safe, and unlike + // a reconnect it does not re-download the whole buffer (costly remotely) + // or risk an infinite replay storm (#1047). Mirrors Tower's own + // bufferedAmount drop on the send side. this.queuedBytes += payload.length; if (this.queuedBytes > MAX_QUEUE) { - this.log('WARN', 'Backpressure exceeded 1MB — disconnecting for replay'); + this.warnDroppedThrottled(); this.queuedBytes = 0; - this.reconnect(); return; } @@ -320,6 +386,50 @@ export class CodevPseudoterminal implements vscode.Pseudoterminal { } } + /** + * Warn (at most once per interval) that live output was dropped under + * backpressure. Throttled so a sustained burst can't spam the log the way + * the old reconnect-storm did (#1047 produced ~14k lines in under an hour). + */ + private warnDroppedThrottled(): void { + const now = Date.now(); + if (now - this.lastDropWarnAt >= DROP_WARN_INTERVAL_MS) { + this.lastDropWarnAt = now; + this.log('WARN', 'Backpressure exceeded 1MB — dropping output (terminal will repaint)'); + } + } + + /** + * Schedule the post-connect repaint nudge (#1047). Fires once after a settle + * delay; if the pane is still blank (nothing rendered since connect), it + * forces a SIGWINCH so the app redraws — the equivalent of the web client's + * unconditional post-connect resize. Skipped when replay/live output has + * already painted, so a reconnect that rendered from the buffer doesn't + * reflow. + */ + private scheduleRepaintNudge(): void { + this.clearRepaintNudge(); + this.repaintNudgeTimer = setTimeout(() => { + this.repaintNudgeTimer = null; + if (this.disposed || this.renderedSinceConnect) { return; } + if (!this.lastDimensions) { return; } + const { cols, rows } = this.lastDimensions; + if (rows <= 1) { this.sendResize(cols, rows); return; } + // A 1-row change then back guarantees a real TIOCSWINSZ delta (and thus a + // SIGWINCH) even if the PTY is already at the target size, ending at the + // correct dimensions. The brief intermediate frame is overwritten at once. + this.sendResize(cols, rows - 1); + this.sendResize(cols, rows); + }, REPAINT_NUDGE_DELAY_MS); + } + + private clearRepaintNudge(): void { + if (this.repaintNudgeTimer) { + clearTimeout(this.repaintNudgeTimer); + this.repaintNudgeTimer = null; + } + } + private writeChunked(text: string): void { let offset = 0; const writeNext = (): void => {