From e7e610a5ccd126357a1766ff72f8befc3e4900cb Mon Sep 17 00:00:00 2001 From: Amr Elsayed Date: Mon, 15 Jun 2026 08:12:29 +1000 Subject: [PATCH 01/31] chore(porch): 1047 init pir --- .../status.yaml | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) create mode 100644 codev/projects/1047-tower-terminals-architects-bui/status.yaml diff --git a/codev/projects/1047-tower-terminals-architects-bui/status.yaml b/codev/projects/1047-tower-terminals-architects-bui/status.yaml new file mode 100644 index 000000000..a05b64399 --- /dev/null +++ b/codev/projects/1047-tower-terminals-architects-bui/status.yaml @@ -0,0 +1,18 @@ +id: '1047' +title: tower-terminals-architects-bui +protocol: pir +phase: plan +plan_phases: [] +current_plan_phase: null +gates: + plan-approval: + status: pending + dev-approval: + status: pending + pr: + status: pending +iteration: 1 +build_complete: false +history: [] +started_at: '2026-06-14T22:12:29.323Z' +updated_at: '2026-06-14T22:12:29.323Z' From a38d6fb4d3368a4a17a7497f9a392261a8f7fd2c Mon Sep 17 00:00:00 2001 From: Amr Elsayed Date: Mon, 15 Jun 2026 08:20:39 +1000 Subject: [PATCH 02/31] [PIR #1047] Plan draft --- .../1047-tower-terminals-architects-bui.md | 84 +++++++++++++++++++ codev/state/pir-1047_thread.md | 19 +++++ 2 files changed, 103 insertions(+) create mode 100644 codev/plans/1047-tower-terminals-architects-bui.md create mode 100644 codev/state/pir-1047_thread.md diff --git a/codev/plans/1047-tower-terminals-architects-bui.md b/codev/plans/1047-tower-terminals-architects-bui.md new file mode 100644 index 000000000..0e7475382 --- /dev/null +++ b/codev/plans/1047-tower-terminals-architects-bui.md @@ -0,0 +1,84 @@ +# PIR Plan: Tower terminal freeze (CPU-bound listener/session accumulation) + +> Issue #1047. Tower terminals (architects + builders) become non-responsive until `afx tower stop && afx tower start`. + +## Understanding + +Over ~10 hours of Tower uptime, Tower's CPU climbs roughly linearly (0.6% at t=3m to 93% at t=10h) and every PTY terminal stops rendering output at once. A default (non-force) `afx tower stop && afx tower start` restores responsiveness; force-kill of shellpers is not involved. The diagnostic comments establish: + +- The event loop is alive (SSE heartbeats fire on schedule every 30s), so this is not a deadlock. CPU saturation is starving message delivery, which is why clients eventually fail keepalive and terminals look frozen. +- Memory grows only +76 MB over 10h and FDs grow by ~9, both minor. This rules out the GC-stall / memory-bloat mechanism. The load-bearing signal is **CPU**. +- The cron firing path and the SSE pump itself were already ruled out by the reporter (cron is bursty/bounded; the SSE pump at `tower-server.ts:192-262` is well-defended). +- The SSE-client churn visible in the logs is partly **by design**: the heartbeat evicts SSE connections older than `SSE_MAX_AGE_MS` (5 min) so clients reconnect (`tower-server.ts:234-262`). That churn is a symptom amplifier, not the root cause. + +The shape that fits all of this is **a per-data-frame or per-tick hot path doing work proportional to a growing in-memory collection** (an EventEmitter listener set, or a list/map iterated on every PTY byte burst). The regression window points at PIR #991 (terminal reconnect / successor-session resolution), whose reconnect cycles are the classic listener-accumulation surface. + +### What the code review found (grounded, but not yet runtime-confirmed) + +The hot path per PTY byte burst is `PtySession.onPtyData` (`packages/codev/src/terminal/pty-session.ts:251-281`): it pushes to the ring buffer, writes the disk log, iterates `this.clients`, and emits `'data'`. It is fed by the shellper `client.on('data', ...)` subscription installed in `attachShellper` (`pty-session.ts:142`). Two latent hazards make this path able to inflate with uptime: + +1. **`createSessionRaw` overwrites without teardown.** `PtyManager.createSessionRaw` (`packages/codev/src/terminal/pty-manager.ts:126-163`) ends with `this.sessions.set(id, session)` and never tears down a pre-existing entry under the same `id`. Its inline comment argues "the in-memory sessions map is empty at reconcile time, so reusing the id can't collide" — but that reasoning holds only for the **startup** reconcile, not for the **on-the-fly** reconnect path (`tower-terminals.ts:923`) which runs while Tower is live and the map is populated. + +2. **`attachShellper` adds listeners with no prior-listener removal.** `attachShellper` (`pty-session.ts:119-197`) installs `data`, `exit`, and `close` listeners on the shellper client and, in the `restartOnExit` branch, additional `data` listeners (`pty-session.ts:163-179`). `detachShellper` (`pty-session.ts:228-234`) does `removeAllListeners()`, but it is only invoked on Tower shutdown — not when an on-the-fly reconnect replaces a PtySession. If an **old** PtySession is dropped from the manager map while its old shellper client (or any closure capturing it) stays referenced, the old `data` listener keeps firing `onPtyData`, doubling the work per byte. Repeat across reconnect cycles and the per-byte cost grows. + +The on-the-fly reconnect path (`tower-terminals.ts:860-958`) is guarded by `!session`, so under normal operation it fires once per lost-session window — not on every poll. **This means static analysis cannot prove this is the production trigger.** The exact accumulation cadence only manifests after hours of real use, so the honest position is: this is the highest-confidence *candidate*, and the plan must both (a) harden it safely and (b) instrument Tower so the next occurrence is captured with hard data rather than inferred. + +## Proposed Change + +A three-part change, investigation-first so we confirm the mechanism rather than guess, paired with safe hardening and a fast regression test that compresses the 10-hour leak into CI. + +### Part A — Bounded, always-on instrumentation (confirm the mechanism) + +Add a cheap periodic self-report, hosted on the existing 30s SSE heartbeat interval (`tower-server.ts:236`) so it adds no new timer. Each tick logs a single structured line capturing the collections most likely to grow: + +- Total live `PtySession` count and total live `ShellperClient` count. +- Per-session `data` / `exit` / `close` listener counts on the shellper client (via `emitter.listenerCount(event)`), and the session's WebSocket `clientCount`. +- Total terminal WebSocket connection count and total SSE client count. +- Tower's own CPU since the last tick via `process.cpuUsage()` delta, normalized to a percentage. + +When any per-client listener count crosses a small threshold (for example > 5), emit a `WARN` naming the offending terminal id. This is O(number of sessions) once per 30s — negligible — and it directly confirms or refutes the listener-accumulation hypothesis and pins *which* collection grows. The reviewer can watch these lines live at the `dev-approval` gate, and they remain in production logs to catch the next real occurrence if the hardening below turns out incomplete. + +### Part B — Defensive hardening (safe regardless of whether it is THE cause) + +1. **Make PtySession replacement under a reused id non-leaking.** Before `createSessionRaw` (and the on-the-fly reconnect path) installs a new PtySession under an existing id, tear down any prior PtySession bound to that id: detach it from its shellper client (`removeAllListeners` on the old client) and clear its WebSocket clients/timers. Concretely, give `PtyManager` a guarded replace that calls the existing teardown (`detachShellper` / `cleanup`) on a colliding entry instead of silently overwriting the map. + +2. **Make `attachShellper` idempotent.** If `attachShellper` is called when a previous shellper client is still attached (or re-called on the same session), remove the previously-installed `data` / `exit` / `close` listeners from the old client first, so a re-attach cannot double the per-byte fan-out. + +3. **Add a `setMaxListeners` tripwire** on the shellper client (a small bound, for example 12) so any future accumulation surfaces as Node's native "possible EventEmitter memory leak" warning in the Tower log rather than silently degrading CPU. + +These three are correct in their own right (a replaced/re-attached session should never keep firing old listeners), so they are safe to land even if the production trigger turns out to be a different collection — in which case Part A's instrumentation tells us where to look next. + +### Part C — Accelerated repro test (capture data, not speculation) + +Add a unit/integration test that drives many reconnect/re-attach cycles against a fake in-memory shellper client and asserts that, after N cycles, (a) the shellper client's `data`/`exit`/`close` listener counts stay bounded, (b) the live `PtySession` count stays bounded, and (c) a single emitted data frame is processed exactly once (not N times). This compresses the 10-hour leak into a deterministic test and is what the `dev-approval` reviewer and CI run to verify the fix without waiting 10 hours. + +## Files to Change + +- `packages/codev/src/terminal/pty-session.ts` — make `attachShellper` idempotent (remove prior client listeners before re-subscribing); add `setMaxListeners` tripwire on the shellper client. (`attachShellper` ~119-197; `detachShellper` ~228-234.) +- `packages/codev/src/terminal/pty-manager.ts:126-163` — `createSessionRaw`: tear down any pre-existing PtySession under the same id before `this.sessions.set(...)`; update the now-inaccurate "can't collide" comment. +- `packages/codev/src/agent-farm/servers/tower-terminals.ts:860-958` — on-the-fly reconnect: ensure the replaced session is explicitly detached/torn down (belt-and-suspenders with the PtyManager change above). +- `packages/codev/src/agent-farm/servers/tower-server.ts:236-262` — add the bounded per-tick instrumentation onto the existing SSE heartbeat (or a sibling `.unref()`'d interval if cleaner), plus a small CPU-delta helper. +- `packages/codev/src/terminal/__tests__/` (and/or `packages/codev/src/agent-farm/__tests__/`) — new accelerated reconnect/listener-bound regression test. + +Final file list to be confirmed during implementation; the instrumentation may warrant a tiny helper module rather than inlining into `tower-server.ts`. + +## Risks & Alternatives Considered + +- **Risk: the hardening does not fix the production trigger** (the real growing collection is elsewhere). Mitigation: Part A's instrumentation is the safety net — it ships regardless and pins the true collection on the next occurrence. The hardening is independently correct, so it does no harm. +- **Risk: tearing down a replaced session detaches a still-needed live client.** Mitigation: only tear down when the id genuinely collides with a *different* PtySession instance, and route through the existing `detachShellper` (which already exists for the shutdown path) so semantics match a known-safe teardown. The accelerated test asserts a live data frame still reaches exactly one consumer after re-attach. +- **Risk: `setMaxListeners(12)` warns on a legitimately high fan-out.** Mitigation: the bound is a tripwire for logs only (Node warns, does not throw); tune if a legitimate case exceeds it. +- **Alternative considered — pure "instrument and wait" (no hardening).** Rejected: the listener/teardown gaps are real latent bugs worth fixing now, and shipping the fix plus the instrument is strictly better than shipping the instrument alone. +- **Alternative considered — bisect v3.1.5..v3.1.7 to pin the regression PR first.** Rejected as the *primary* path because it needs a 12h+ soak per candidate and does not produce a fix; the instrumentation captures the same signal faster and the hardening lands the fix. Bisect remains available as a fallback if Part A's data points away from the reconnect surface. + +## Test Plan + +The reviewer exercises this at the `dev-approval` gate against the running worktree: + +- **Unit/integration (fast, deterministic):** run the new accelerated reconnect test — `pnpm --filter @cluesmith/codev test` from the worktree. It must show listener and session counts staying bounded across many cycles, and a single data frame processed exactly once. +- **Full suite:** `pnpm --filter @cluesmith/codev test` (and `pnpm --filter @cluesmith/codev build`) green, run from the worktree (not the main checkout). +- **Manual / live observability:** start the worktree's Tower (`afx dev ` or a local Tower start against this branch), open a couple of architect/builder terminals, and confirm the new per-tick instrumentation line appears every 30s in `afx tower log -f` with sane, *stable* counts (listener counts staying at their baseline, not climbing) as terminals are opened, closed, and reconnected. Force a few terminal WS reconnects (close/reopen the VSCode tab) and confirm per-session listener counts return to baseline rather than incrementing. +- **Soak (optional, post-merge / verify):** leave Tower running for several hours on a real workload and confirm CPU stays low and the instrumentation counts stay flat. This is the true end-to-end confirmation; it is noted here for completeness but is not gating for the PR since it cannot fit a gate session. + +## Open Question for the Reviewer + +If you want this scoped more tightly, the two natural cut points are: (1) ship **only** Part A + Part C now (instrument + test, no behavior change) and use the captured data to target a follow-up fix, or (2) ship all three parts as proposed. I recommend (2) because the hardening is independently correct and low-risk, but the choice is yours at the gate. diff --git a/codev/state/pir-1047_thread.md b/codev/state/pir-1047_thread.md new file mode 100644 index 000000000..ce94e84da --- /dev/null +++ b/codev/state/pir-1047_thread.md @@ -0,0 +1,19 @@ +# PIR #1047 — Tower terminals freeze until restart + +## Phase: plan (in progress) + +### Investigation summary +- Symptom (from issue + 4 diagnostic comments): CPU climbs linearly with uptime (0.6% → 93% over ~10h), ALL PTY terminals freeze, restart fixes it, memory grows only +76MB (minor), FD growth glacial. SSE clients churn/marked-dead is a *secondary* symptom of CPU starvation (and SSE 5-min max-age eviction is by-design, not the bug). +- Leak-shape: CPU-bound work proportional to a growing in-memory collection, iterated on a per-data-frame or per-tick hot path. Regression window points at PIR #991 (terminal reconnect / successor-session resolution). +- Comments already ruled out: cron firing (bursty, bounded), the SSE pump itself (well-defended), GC-stall/memory mechanism (RSS minor). + +### Code findings (grounded) +- `PtySession.onPtyData` (pty-session.ts:251-281) is the per-data-frame hot path: pushes ring buffer, writes disk log, iterates `this.clients`, emits `'data'`. The shellper `client.on('data')` fan-out (attachShellper, pty-session.ts:142) feeds it. +- HAZARD 1: `createSessionRaw` (pty-manager.ts:126-163) does `this.sessions.set(id, session)` with NO teardown of a pre-existing entry under the same id. Its safety comment ("map is empty at reconcile time, can't collide") only holds at startup reconcile, NOT the on-the-fly reconnect path (tower-terminals.ts:923) which runs live. +- HAZARD 2: `attachShellper` (pty-session.ts:119-197) adds `data`/`exit`/`close` listeners to the shellper client with no prior-listener removal; nothing explicitly detaches the OLD PtySession from its OLD client on reconnect (relies on GC, which fails if the old client stays referenced/alive). +- On-the-fly reconnect (tower-terminals.ts:860-958) is guarded by `!session`, so it fires once per lost-session window (NOT every poll — the subagent over-claimed "every poll"). Exact trigger for repeated accumulation is not confirmable by static read alone. + +### Plan direction +Investigation-first + defensive hardening + accelerated repro test + bounded always-on instrumentation. Root cause is not confidently pinned statically (only manifests after ~10h), so the plan instruments to confirm on next occurrence AND lands safe hardening + a fast test that compresses the 10h leak into CI. + +Plan written to codev/plans/1047-tower-terminals-architects-bui.md — awaiting plan-approval gate. From a23b89c4fe3192bce1171c194e46c482b3640ebd Mon Sep 17 00:00:00 2001 From: Amr Elsayed Date: Mon, 15 Jun 2026 08:20:45 +1000 Subject: [PATCH 03/31] chore(porch): 1047 plan-approval gate-requested --- codev/projects/1047-tower-terminals-architects-bui/status.yaml | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/codev/projects/1047-tower-terminals-architects-bui/status.yaml b/codev/projects/1047-tower-terminals-architects-bui/status.yaml index a05b64399..773bb42aa 100644 --- a/codev/projects/1047-tower-terminals-architects-bui/status.yaml +++ b/codev/projects/1047-tower-terminals-architects-bui/status.yaml @@ -7,6 +7,7 @@ current_plan_phase: null gates: plan-approval: status: pending + requested_at: '2026-06-14T22:20:45.222Z' dev-approval: status: pending pr: @@ -15,4 +16,4 @@ iteration: 1 build_complete: false history: [] started_at: '2026-06-14T22:12:29.323Z' -updated_at: '2026-06-14T22:12:29.323Z' +updated_at: '2026-06-14T22:20:45.223Z' From 64dfbbc74c09a9706b6a438562dec4fd1f4bf438 Mon Sep 17 00:00:00 2001 From: Amr Elsayed Date: Mon, 15 Jun 2026 08:49:43 +1000 Subject: [PATCH 04/31] [PIR #1047] Plan revised: confirmed root cause (unbounded no-newline buffers) --- .../1047-tower-terminals-architects-bui.md | 119 +++++++++++------- codev/state/pir-1047_thread.md | 19 ++- 2 files changed, 87 insertions(+), 51 deletions(-) diff --git a/codev/plans/1047-tower-terminals-architects-bui.md b/codev/plans/1047-tower-terminals-architects-bui.md index 0e7475382..6155cdeb3 100644 --- a/codev/plans/1047-tower-terminals-architects-bui.md +++ b/codev/plans/1047-tower-terminals-architects-bui.md @@ -1,84 +1,107 @@ -# PIR Plan: Tower terminal freeze (CPU-bound listener/session accumulation) +# PIR Plan: Tower terminal freeze — unbounded no-newline buffers (O(n²) output pump) -> Issue #1047. Tower terminals (architects + builders) become non-responsive until `afx tower stop && afx tower start`. +> Issue #1047. Tower terminals (architects + builders) become non-responsive over time; the only known recovery is a Tower restart, and even that is **not reliably** effective. ## Understanding -Over ~10 hours of Tower uptime, Tower's CPU climbs roughly linearly (0.6% at t=3m to 93% at t=10h) and every PTY terminal stops rendering output at once. A default (non-force) `afx tower stop && afx tower start` restores responsiveness; force-kill of shellpers is not involved. The diagnostic comments establish: +### Root cause (empirically confirmed) -- The event loop is alive (SSE heartbeats fire on schedule every 30s), so this is not a deadlock. CPU saturation is starving message delivery, which is why clients eventually fail keepalive and terminals look frozen. -- Memory grows only +76 MB over 10h and FDs grow by ~9, both minor. This rules out the GC-stall / memory-bloat mechanism. The load-bearing signal is **CPU**. -- The cron firing path and the SSE pump itself were already ruled out by the reporter (cron is bursty/bounded; the SSE pump at `tower-server.ts:192-262` is well-defended). -- The SSE-client churn visible in the logs is partly **by design**: the heartbeat evicts SSE connections older than `SSE_MAX_AGE_MS` (5 min) so clients reconnect (`tower-server.ts:234-262`). That churn is a symptom amplifier, not the root cause. +Tower's per-PTY-frame hot path re-scans an **unbounded, newline-delimited buffer** on every byte burst. A full-screen TUI (Claude Code's prompt/UI) runs in the alternate screen buffer and redraws **in place** using cursor-addressing and carriage returns, emitting almost **no `\n` bytes**. Two buffers in the output path bound themselves by newline count and therefore never bound at all for such a stream: -The shape that fits all of this is **a per-data-frame or per-tick hot path doing work proportional to a growing in-memory collection** (an EventEmitter listener set, or a list/map iterated on every PTY byte burst). The regression window points at PIR #991 (terminal reconnect / successor-session resolution), whose reconnect cycles are the classic listener-accumulation surface. +1. **`RingBuffer.partial`** (`packages/codev/src/terminal/ring-buffer.ts:36-51`). `pushData` does `const combined = this.partial + data; combined.split('\n')` and saves the trailing fragment back into `this.partial`. With no `\n` in the stream, `partial` grows without limit, and **every frame re-concatenates and re-splits the entire accumulated buffer** — O(partial) per frame, O(n²) over the session. There is no byte cap on `partial` (confirmed: the only byte cap in the terminal layer is the *stderr* ring buffer's `maxLineLength = 10000` at `session-manager.ts:67` — the main RingBuffer never got one). +2. **`ShellperReplayBuffer`** (`packages/codev/src/terminal/shellper-replay-buffer.ts:45,58`) on the shellper side. It evicts only `while (this.lineCount > this.maxLines …)`. With zero newlines, `lineCount` stays 0, eviction never fires, and `chunks` grows unbounded. Every PTY chunk is appended here (`shellper-process.ts:143`). -### What the code review found (grounded, but not yet runtime-confirmed) +### Empirical evidence -The hot path per PTY byte burst is `PtySession.onPtyData` (`packages/codev/src/terminal/pty-session.ts:251-281`): it pushes to the ring buffer, writes the disk log, iterates `this.clients`, and emits `'data'`. It is fed by the shellper `client.on('data', ...)` subscription installed in `attachShellper` (`pty-session.ts:142`). Two latent hazards make this path able to inflate with uptime: +Measuring the on-disk PTY logs (`~/.agent-farm/logs/*.log`, raw terminal output) across real sessions: -1. **`createSessionRaw` overwrites without teardown.** `PtyManager.createSessionRaw` (`packages/codev/src/terminal/pty-manager.ts:126-163`) ends with `this.sessions.set(id, session)` and never tears down a pre-existing entry under the same `id`. Its inline comment argues "the in-memory sessions map is empty at reconcile time, so reusing the id can't collide" — but that reasoning holds only for the **startup** reconcile, not for the **on-the-fly** reconnect path (`tower-terminals.ts:923`) which runs while Tower is live and the map is populated. +| Session log | Size | Newlines | Longest run without `\n` | +|---|---|---|---| +| `f02bedcb…` (14 Jun 21:01, **incident window**) | 15 MB | **0** | **14.57 MB (entire file)** | +| `d8406afb…` | 24 MB | 1,215,838 | 5,295 bytes | +| `4d5523cb…` | 20 MB | 1,098,370 | 7,596 bytes | -2. **`attachShellper` adds listeners with no prior-listener removal.** `attachShellper` (`pty-session.ts:119-197`) installs `data`, `exit`, and `close` listeners on the shellper client and, in the `restartOnExit` branch, additional `data` listeners (`pty-session.ts:163-179`). `detachShellper` (`pty-session.ts:228-234`) does `removeAllListeners()`, but it is only invoked on Tower shutdown — not when an on-the-fly reconnect replaces a PtySession. If an **old** PtySession is dropped from the manager map while its old shellper client (or any closure capturing it) stays referenced, the old `data` listener keeps firing `onPtyData`, doubling the work per byte. Repeat across reconnect cycles and the per-byte cost grows. +The incident-window session emitted **15 MB with not a single newline**. Byte census of that file: 1,500,892 ESC (`0x1b`), 164,652 CR (`0x0d`), 0 LF — a control-heavy redraw stream (`\e[?1049h \e[2J \e[H \e[?25l … \e[3G \e[5G … \r`). For that session, Tower's `partial` grows to ~15 MB and every incoming frame re-splits ~15 MB. Other sessions (normal newline density, ~20 bytes/line) are unaffected — which is exactly why the bug is intermittent and session-dependent. -The on-the-fly reconnect path (`tower-terminals.ts:860-958`) is guarded by `!session`, so under normal operation it fires once per lost-session window — not on every poll. **This means static analysis cannot prove this is the production trigger.** The exact accumulation cadence only manifests after hours of real use, so the honest position is: this is the highest-confidence *candidate*, and the plan must both (a) harden it safely and (b) instrument Tower so the next occurrence is captured with hard data rather than inferred. +### Why this matches every observation -## Proposed Change +- **CPU climbs ~linearly with uptime, then ~93% (one core saturated).** `partial` length grows ∝ uptime for an actively-redrawing TUI; per-frame cost ∝ `partial`; frame rate roughly constant → CPU ∝ uptime. Node is single-threaded, so one saturating session pegs ~one core. (Note: the "linear" framing rests on two CPU samples; this mechanism is consistent with a linear climb, and also with a faster ramp once a heavy session starts — both fit.) +- **ALL terminals freeze at once.** The O(n²) re-scan runs on the single shared event loop. One heavy session starves every other terminal's I/O behind it. +- **Memory grows only modestly (+76 MB / 10h).** The cost is CPU (repeated scan + GC churn from re-allocating multi-MB strings each frame), not retained memory. The live `partial`s plus GC headroom account for the modest RSS rise. +- **Input still propagates while render-back is broken** (the issue's stray-`e` screenshot). Writing input is O(small) and occasionally slips through; the output pump's O(partial) re-scan dominates and stalls rendering. +- **Restart is NOT reliably effective.** This is the key point in your question. `ShellperReplayBuffer` is *also* unbounded for no-newline streams, so on restart the shellper replays the full ~15 MB, Tower seeds a fresh `partial` from it, and the O(n²) re-scan resumes almost immediately if the heavy session is still redrawing. Restart only "fixes" it when the offending session has gone quiet or ended. That session-dependence is why you're not sure the restart resolves it. -A three-part change, investigation-first so we confirm the mechanism rather than guess, paired with safe hardening and a fast regression test that compresses the 10-hour leak into CI. +### Pathway walk-through (where responsiveness can break) -### Part A — Bounded, always-on instrumentation (confirm the mechanism) +Output (render-back) path, which is the broken direction: -Add a cheap periodic self-report, hosted on the existing 30s SSE heartbeat interval (`tower-server.ts:236`) so it adds no new timer. Each tick logs a single structured line capturing the collections most likely to grow: +1. **Shellper:** `pty.onData` → `ShellperReplayBuffer.append(buf)` **[BUG #2: unbounded on no-newline]** → forwards the chunk over the unix socket to Tower (`shellper-process.ts:143-146`). +2. **Tower:** `ShellperClient` parser emits `'data'` → `PtySession.onPtyData` (`pty-session.ts:251-281`): `RingBuffer.pushData` **[BUG #1: O(n²) re-scan + unbounded]**, then a synchronous `fs.writeSync` disk-log per frame **[secondary event-loop blocker]**, then broadcast to WS clients (drops frames when `ws.bufferedAmount ≥ 1 MB` — a symptom under starvation, not a cause), then `emit('data')`. +3. **Client:** WS → VSCode extension terminal client → xterm.js render. -- Total live `PtySession` count and total live `ShellperClient` count. -- Per-session `data` / `exit` / `close` listener counts on the shellper client (via `emitter.listenerCount(event)`), and the session's WebSocket `clientCount`. -- Total terminal WebSocket connection count and total SSE client count. -- Tower's own CPU since the last tick via `process.cpuUsage()` delta, normalized to a percentage. +Input path (stays responsive): WS message → `session.write` → `shellperClient.write` → unix socket → shellper → PTY. Small, cheap; consistent with the observed "input gets through, output frozen." -When any per-client listener count crosses a small threshold (for example > 5), emit a `WARN` naming the offending terminal id. This is O(number of sessions) once per 30s — negligible — and it directly confirms or refutes the listener-accumulation hypothesis and pins *which* collection grows. The reviewer can watch these lines live at the `dev-approval` gate, and they remain in production logs to catch the next real occurrence if the hardening below turns out incomplete. +The dominant responsiveness failure is Bug #1 (the O(n²) output re-scan); Bug #2 makes restart unreliable and leaks shellper memory; the per-frame `writeSync` is a smaller additive event-loop cost. -### Part B — Defensive hardening (safe regardless of whether it is THE cause) +### Status of the earlier listener-leak hypothesis -1. **Make PtySession replacement under a reused id non-leaking.** Before `createSessionRaw` (and the on-the-fly reconnect path) installs a new PtySession under an existing id, tear down any prior PtySession bound to that id: detach it from its shellper client (`removeAllListeners` on the old client) and clear its WebSocket clients/timers. Concretely, give `PtyManager` a guarded replace that calls the existing teardown (`detachShellper` / `cleanup`) on a colliding entry instead of silently overwriting the map. +My first draft led with an EventEmitter listener-accumulation theory on the PIR #991 reconnect surface. After empirical measurement, that is **demoted to a secondary, defensive cleanup**: it does not explain the 15 MB zero-newline artifact, the CPU-without-memory profile, or the restart-unreliability, whereas the buffer mechanism explains all three. There remain real-but-minor listener-hygiene gaps (`createSessionRaw` overwrites `sessions` without teardown at `pty-manager.ts:161`; `attachShellper` re-subscribes without removing prior listeners) which are worth a low-risk guard but are not the headline. -2. **Make `attachShellper` idempotent.** If `attachShellper` is called when a previous shellper client is still attached (or re-called on the same session), remove the previously-installed `data` / `exit` / `close` listeners from the old client first, so a re-attach cannot double the per-byte fan-out. +## Proposed Change -3. **Add a `setMaxListeners` tripwire** on the shellper client (a small bound, for example 12) so any future accumulation surfaces as Node's native "possible EventEmitter memory leak" warning in the Tower log rather than silently degrading CPU. +### Fix 1 (primary) — stop re-scanning, and byte-cap `RingBuffer.partial` -These three are correct in their own right (a replaced/re-attached session should never keep firing old listeners), so they are safe to land even if the production trigger turns out to be a different collection — in which case Part A's instrumentation tells us where to look next. +In `RingBuffer.pushData` (`ring-buffer.ts`): -### Part C — Accelerated repro test (capture data, not speculation) +- **Scan only the incoming `data` for newlines**, not the whole `partial + data`. Walk newline indices in `data`, push completed lines (the current `partial` prefixes only the first), and keep the trailing remainder as the new `partial`. This makes per-frame work O(|data|) instead of O(|partial|) and is **behavior-preserving** for replay (same lines, same partial). +- **Cap `partial` to a max byte length** (e.g. a generous `MAX_PARTIAL_BYTES`, on the order of 256 KB–1 MB — large enough that real lines are never clipped, small enough to bound work/memory). When an unbroken run exceeds the cap, trim the front of `partial` to the last `MAX_PARTIAL_BYTES`. Front-trim (rather than injecting a synthetic `\n`) avoids corrupting a TUI replay with spurious line feeds; the slight loss of the alt-screen-enter prefix self-heals because reconnect triggers a full TUI repaint (resize). The exact cap and trim-vs-segment choice is finalized in implementation and validated at the `dev-approval` gate. -Add a unit/integration test that drives many reconnect/re-attach cycles against a fake in-memory shellper client and asserts that, after N cycles, (a) the shellper client's `data`/`exit`/`close` listener counts stay bounded, (b) the live `PtySession` count stays bounded, and (c) a single emitted data frame is processed exactly once (not N times). This compresses the 10-hour leak into a deterministic test and is what the `dev-approval` reviewer and CI run to verify the fix without waiting 10 hours. +### Fix 2 (primary) — byte-cap `ShellperReplayBuffer` -## Files to Change +In `ShellperReplayBuffer.append` (`shellper-replay-buffer.ts`): add a `maxBytes` cap alongside `maxLines` (it already tracks `totalBytes`). Evict oldest chunks while `totalBytes > maxBytes`, with a single-chunk front-trim edge case mirroring the existing line logic. This bounds shellper memory for no-newline streams and, crucially, bounds the REPLAY frame so a restart can no longer re-seed a multi-MB `partial` — making restart-as-recovery deterministic and keeping it cheap. + +### Fix 3 (defensive, optional) — listener hygiene + +Make `attachShellper` idempotent (remove prior `data`/`exit`/`close` listeners before re-subscribing) and have the on-the-fly reconnect / `createSessionRaw` path tear down any pre-existing PtySession under a reused id before replacing it. Correct in its own right; low risk. Include only if it does not bloat the change — otherwise spin out to a follow-up issue. -- `packages/codev/src/terminal/pty-session.ts` — make `attachShellper` idempotent (remove prior client listeners before re-subscribing); add `setMaxListeners` tripwire on the shellper client. (`attachShellper` ~119-197; `detachShellper` ~228-234.) -- `packages/codev/src/terminal/pty-manager.ts:126-163` — `createSessionRaw`: tear down any pre-existing PtySession under the same id before `this.sessions.set(...)`; update the now-inaccurate "can't collide" comment. -- `packages/codev/src/agent-farm/servers/tower-terminals.ts:860-958` — on-the-fly reconnect: ensure the replaced session is explicitly detached/torn down (belt-and-suspenders with the PtyManager change above). -- `packages/codev/src/agent-farm/servers/tower-server.ts:236-262` — add the bounded per-tick instrumentation onto the existing SSE heartbeat (or a sibling `.unref()`'d interval if cleaner), plus a small CPU-delta helper. -- `packages/codev/src/terminal/__tests__/` (and/or `packages/codev/src/agent-farm/__tests__/`) — new accelerated reconnect/listener-bound regression test. +### Instrumentation (targeted, cheap) + +On the existing 30s SSE heartbeat, log per-session `ringBuffer` partial length and shellper replay byte size, and `WARN` when a partial exceeds a threshold (naming the terminal id). This directly observes the now-known cause and confirms the fix holds in production. Far more targeted than the listener-count instrumentation from the first draft. + +### Explicitly out of scope + +- Converting the per-frame `fs.writeSync` disk log to async/batched (a smaller, separate optimization — note for a follow-up if profiling after Fix 1 still shows it). +- Any change to default `tower stop` shellper-survival behavior (#274 / #832 / #999 / #991). +- The cron `ReferenceError` (#1048) and the spawn-failure sibling (#1038). + +## Files to Change -Final file list to be confirmed during implementation; the instrumentation may warrant a tiny helper module rather than inlining into `tower-server.ts`. +- `packages/codev/src/terminal/ring-buffer.ts` — `pushData`: scan only `data`; byte-cap `partial`. Add `MAX_PARTIAL_BYTES`. +- `packages/codev/src/terminal/shellper-replay-buffer.ts` — add `maxBytes` cap + byte-based eviction; thread a sane default (and wire `replayBufferLines`/a new bytes option from `shellper-process.ts:97` if needed). +- `packages/codev/src/agent-farm/servers/tower-server.ts:236-262` — bounded per-tick partial/replay-size instrumentation on the SSE heartbeat. +- (optional, Fix 3) `packages/codev/src/terminal/pty-session.ts` (`attachShellper`) and `packages/codev/src/terminal/pty-manager.ts:161` (`createSessionRaw` teardown). +- Tests under `packages/codev/src/terminal/__tests__/` — see Test Plan. ## Risks & Alternatives Considered -- **Risk: the hardening does not fix the production trigger** (the real growing collection is elsewhere). Mitigation: Part A's instrumentation is the safety net — it ships regardless and pins the true collection on the next occurrence. The hardening is independently correct, so it does no harm. -- **Risk: tearing down a replaced session detaches a still-needed live client.** Mitigation: only tear down when the id genuinely collides with a *different* PtySession instance, and route through the existing `detachShellper` (which already exists for the shutdown path) so semantics match a known-safe teardown. The accelerated test asserts a live data frame still reaches exactly one consumer after re-attach. -- **Risk: `setMaxListeners(12)` warns on a legitimately high fan-out.** Mitigation: the bound is a tripwire for logs only (Node warns, does not throw); tune if a legitimate case exceeds it. -- **Alternative considered — pure "instrument and wait" (no hardening).** Rejected: the listener/teardown gaps are real latent bugs worth fixing now, and shipping the fix plus the instrument is strictly better than shipping the instrument alone. -- **Alternative considered — bisect v3.1.5..v3.1.7 to pin the regression PR first.** Rejected as the *primary* path because it needs a 12h+ soak per candidate and does not produce a fix; the instrumentation captures the same signal faster and the hardening lands the fix. Bisect remains available as a fallback if Part A's data points away from the reconnect surface. +- **Risk: byte-trimming `partial` corrupts replay for a no-newline TUI** (loses early escape state). Mitigation: generous cap so only pathological streams are trimmed; reconnect already drives a full repaint; the accelerated test asserts a *normal* newline stream replays byte-identically (no behavior change for the common case). The reviewer validates a live reconnect at the `dev-approval` gate. +- **Risk: the cap is too aggressive and clips legitimately long single lines** (e.g. a 200 KB JSON blob on one line). Mitigation: set the cap well above realistic single-line sizes; it only bounds *pathological* unbroken runs (megabytes). +- **Risk: there is a second, independent leak** (e.g. the listener gap) that this doesn't cover. Mitigation: the instrumentation ships regardless and will surface any residual growth; Fix 3 covers the known listener gap defensively. +- **Alternative — only add instrumentation and bisect v3.1.5..v3.1.7 first.** Rejected as primary: the disk-log evidence already pins the mechanism without a 12h+ soak, and bisect would land no fix. Bisect remains a fallback if post-fix soak still shows CPU growth. +- **Alternative — segment the over-cap partial as ring lines instead of front-trimming.** Rejected (default) because rejoining ring lines with `\n` injects spurious line feeds into a TUI replay; front-trim avoids that. Revisit if a continuation-aware ring model is wanted later. ## Test Plan -The reviewer exercises this at the `dev-approval` gate against the running worktree: +Run from the worktree (`pnpm --filter @cluesmith/codev …`), not the main checkout. -- **Unit/integration (fast, deterministic):** run the new accelerated reconnect test — `pnpm --filter @cluesmith/codev test` from the worktree. It must show listener and session counts staying bounded across many cycles, and a single data frame processed exactly once. -- **Full suite:** `pnpm --filter @cluesmith/codev test` (and `pnpm --filter @cluesmith/codev build`) green, run from the worktree (not the main checkout). -- **Manual / live observability:** start the worktree's Tower (`afx dev ` or a local Tower start against this branch), open a couple of architect/builder terminals, and confirm the new per-tick instrumentation line appears every 30s in `afx tower log -f` with sane, *stable* counts (listener counts staying at their baseline, not climbing) as terminals are opened, closed, and reconnected. Force a few terminal WS reconnects (close/reopen the VSCode tab) and confirm per-session listener counts return to baseline rather than incrementing. -- **Soak (optional, post-merge / verify):** leave Tower running for several hours on a real workload and confirm CPU stays low and the instrumentation counts stay flat. This is the true end-to-end confirmation; it is noted here for completeness but is not gating for the PR since it cannot fit a gate session. +- **Unit — CPU bound (the core fix):** feed a synthetic no-newline stream of M bytes across K frames into `RingBuffer.pushData`; assert per-frame work does not scale with accumulated size (e.g. assert `partial.length` stays ≤ cap, and a frame-cost proxy stays flat as the stream grows). Today this test would show partial growing to M and unbounded re-scan; after the fix it stays bounded. +- **Unit — replay correctness (no regression):** a normal newline-delimited stream produces byte-identical `getAll()` / `getSince()` output before and after the change. +- **Unit — shellper buffer:** feed a zero-newline stream exceeding `maxBytes`; assert `ShellperReplayBuffer.size` stays ≤ `maxBytes` and `getReplayData()` returns the bounded tail. +- **Build + full suite:** `pnpm --filter @cluesmith/codev build` and `… test` green from the worktree. +- **Manual / live (at `dev-approval`):** start Tower on this branch, open an architect terminal running Claude's full-screen UI, let it redraw for a while, and watch `afx tower log -f`: the new instrumentation should show the partial size **plateau at the cap** rather than climb, and Tower CPU should stay low. Reconnect the terminal (close/reopen the VSCode tab) and confirm the screen repaints correctly (replay still works). Optionally stop/start Tower against a still-busy session and confirm CPU does **not** immediately re-saturate (Fix 2 working). +- **Soak (post-merge / verify, non-gating):** leave Tower running several hours on a real workload; confirm CPU stays flat. This is the true end-to-end confirmation but cannot fit a gate session. -## Open Question for the Reviewer +## Open Questions for the Reviewer -If you want this scoped more tightly, the two natural cut points are: (1) ship **only** Part A + Part C now (instrument + test, no behavior change) and use the captured data to target a follow-up fix, or (2) ship all three parts as proposed. I recommend (2) because the hardening is independently correct and low-risk, but the choice is yours at the gate. +1. **Cap sizes:** proposed `MAX_PARTIAL_BYTES` ~256 KB–1 MB and `ShellperReplayBuffer maxBytes` ~ a few MB. Comfortable with these, or want them configurable via env? +2. **Fix 3 (listener hygiene):** include in this PR as a small defensive guard, or split to a follow-up issue to keep this change tightly scoped to the buffer fix? diff --git a/codev/state/pir-1047_thread.md b/codev/state/pir-1047_thread.md index ce94e84da..c076a4cd6 100644 --- a/codev/state/pir-1047_thread.md +++ b/codev/state/pir-1047_thread.md @@ -13,7 +13,20 @@ - HAZARD 2: `attachShellper` (pty-session.ts:119-197) adds `data`/`exit`/`close` listeners to the shellper client with no prior-listener removal; nothing explicitly detaches the OLD PtySession from its OLD client on reconnect (relies on GC, which fails if the old client stays referenced/alive). - On-the-fly reconnect (tower-terminals.ts:860-958) is guarded by `!session`, so it fires once per lost-session window (NOT every poll — the subagent over-claimed "every poll"). Exact trigger for repeated accumulation is not confirmable by static read alone. -### Plan direction -Investigation-first + defensive hardening + accelerated repro test + bounded always-on instrumentation. Root cause is not confidently pinned statically (only manifests after ~10h), so the plan instruments to confirm on next occurrence AND lands safe hardening + a fast test that compresses the 10h leak into CI. +### Plan direction (initial draft — superseded) +First draft led with an EventEmitter listener-leak theory + instrumentation. Demoted after empirical work below. -Plan written to codev/plans/1047-tower-terminals-architects-bui.md — awaiting plan-approval gate. +### ROOT CAUSE CONFIRMED (empirical, 2026-06-15) +Architect pushed back: do we have a real hypothesis, and is restart actually the fix? Investigated the live state + on-disk PTY logs. + +- Tower currently DOWN; 12 shellpers persist (by design). +- Measured `~/.agent-farm/logs/*.log` (raw PTY output). Incident-window log `f02bedcb…` (14 Jun 21:01, 15 MB) has **0 newlines** — longest no-newline run = the entire 14.57 MB. Byte census: 1.5M ESC, 164K CR, 0 LF → a Claude full-screen TUI redraw stream (`\e[?1049h \e[2J \e[H … \e[3G \r`). Normal sessions: ~20 bytes/line, longest line ~5-7KB. +- Root cause: two buffers bound by NEWLINE COUNT, never bound for a no-newline TUI stream: + 1. `RingBuffer.pushData` (ring-buffer.ts:36-51): `partial + data` then `split('\n')` every frame → O(partial) per frame, O(n²)/session, partial unbounded. No byte cap (only the stderr ring at session-manager.ts:67 has maxLineLength=10000). + 2. `ShellperReplayBuffer.append` (shellper-replay-buffer.ts:45,58): evicts only when lineCount>maxLines → 0 newlines → never evicts → unbounded; fed by shellper-process.ts:143. +- Explains ALL symptoms incl. the two the listener theory couldn't: CPU-without-memory, and **restart-unreliability** (shellper replay is also unbounded → restart re-seeds a 15 MB partial → O(n²) resumes unless the heavy session is gone). That is the answer to "not sure restart resolves it." + +### Fix (plan rewritten) +Primary: (1) RingBuffer.pushData scan only `data` + byte-cap `partial` (front-trim); (2) ShellperReplayBuffer byte cap. Secondary/optional: listener hygiene (attachShellper idempotent, createSessionRaw teardown). Plus targeted instrumentation (log partial/replay sizes on the 30s heartbeat). Accelerated unit tests compress the leak into CI; replay-correctness test guards no-regression. + +Plan rewritten in codev/plans/1047-tower-terminals-architects-bui.md — still at plan-approval gate, awaiting review. Two open questions for reviewer: cap sizes, and whether to include listener-hygiene here or split it out. From c916bcc26fdd9e19577a5184d2d70e1daece9afa Mon Sep 17 00:00:00 2001 From: Amr Elsayed Date: Mon, 15 Jun 2026 08:55:41 +1000 Subject: [PATCH 05/31] [PIR #1047] Plan: add client replay-storm pathway + client-side fix (Fix C) --- .../1047-tower-terminals-architects-bui.md | 130 ++++++++++-------- codev/state/pir-1047_thread.md | 10 ++ 2 files changed, 80 insertions(+), 60 deletions(-) diff --git a/codev/plans/1047-tower-terminals-architects-bui.md b/codev/plans/1047-tower-terminals-architects-bui.md index 6155cdeb3..bb90624cd 100644 --- a/codev/plans/1047-tower-terminals-architects-bui.md +++ b/codev/plans/1047-tower-terminals-architects-bui.md @@ -1,107 +1,117 @@ -# PIR Plan: Tower terminal freeze — unbounded no-newline buffers (O(n²) output pump) +# PIR Plan: Tower terminal freeze — oversized replay storm from unbounded no-newline buffers -> Issue #1047. Tower terminals (architects + builders) become non-responsive over time; the only known recovery is a Tower restart, and even that is **not reliably** effective. +> Issue #1047. Tower terminals (architects + builders) become non-responsive over time; restart is the only known recovery, and even that is **not reliably** effective. ## Understanding -### Root cause (empirically confirmed) +### Root cause (confirmed end-to-end with captured data) -Tower's per-PTY-frame hot path re-scans an **unbounded, newline-delimited buffer** on every byte burst. A full-screen TUI (Claude Code's prompt/UI) runs in the alternate screen buffer and redraws **in place** using cursor-addressing and carriage returns, emitting almost **no `\n` bytes**. Two buffers in the output path bound themselves by newline count and therefore never bound at all for such a stream: +A full-screen TUI (Claude Code's UI) runs in the alternate screen buffer and redraws **in place** with cursor-addressing and carriage returns, emitting almost **no `\n`**. Several buffers in the terminal path bound themselves **by newline count**, so they never bound for such a stream. The unbounded buffer then produces a **replay payload larger than the client's 1 MB backpressure limit**, which drives the VSCode terminal client into a **tight, backoff-free disconnect/reconnect/replay loop**. That loop — re-serializing a multi-MB replay thousands of times on Tower's single event loop — is the dominant CPU sink and starves every other terminal. -1. **`RingBuffer.partial`** (`packages/codev/src/terminal/ring-buffer.ts:36-51`). `pushData` does `const combined = this.partial + data; combined.split('\n')` and saves the trailing fragment back into `this.partial`. With no `\n` in the stream, `partial` grows without limit, and **every frame re-concatenates and re-splits the entire accumulated buffer** — O(partial) per frame, O(n²) over the session. There is no byte cap on `partial` (confirmed: the only byte cap in the terminal layer is the *stderr* ring buffer's `maxLineLength = 10000` at `session-manager.ts:67` — the main RingBuffer never got one). -2. **`ShellperReplayBuffer`** (`packages/codev/src/terminal/shellper-replay-buffer.ts:45,58`) on the shellper side. It evicts only `while (this.lineCount > this.maxLines …)`. With zero newlines, `lineCount` stays 0, eviction never fires, and `chunks` grows unbounded. Every PTY chunk is appended here (`shellper-process.ts:143`). +The closed causal loop: -### Empirical evidence +1. **Unbounded accumulation.** `RingBuffer.partial` (`ring-buffer.ts:36-51`) does `combined = this.partial + data; combined.split('\n')` every frame and keeps the trailing fragment; with no `\n`, `partial` grows without bound and each frame re-scans the whole thing (O(partial)/frame → O(n²)/session). The shellper's `ShellperReplayBuffer` (`shellper-replay-buffer.ts:45,58`, fed at `shellper-process.ts:143`) evicts only `while lineCount > maxLines`, so with zero newlines it never evicts. Neither has a byte cap (the only byte cap in the layer is the *stderr* ring's `maxLineLength=10000` at `session-manager.ts:67`). +2. **Oversized replay.** On every (re)connect, Tower sends the buffer as one frame: `ws.send(encodeData(replayLines.join('\n')))` (`tower-websocket.ts:62-65`), where `replayLines = ringBuffer.getAll()` = `[partial]`. For a no-newline session that is the entire accumulated blob. +3. **Client backpressure trips on the replay itself.** The VSCode client's `handleData` (`packages/vscode/src/terminal-adapter.ts:300-308`) does `this.queuedBytes += payload.length; if (this.queuedBytes > MAX_QUEUE /*1 MB*/) { reconnect() }`. The replay frame alone exceeds 1 MB, so it disconnects **before rendering anything**. +4. **Reconnect with no backoff.** `reconnect()` (`terminal-adapter.ts:281-296`) calls `this.backoff.reset()` and reconnects immediately → Tower re-sends the same oversized replay → back to step 3. Infinite loop. -Measuring the on-disk PTY logs (`~/.agent-farm/logs/*.log`, raw terminal output) across real sessions: +### Captured evidence -| Session log | Size | Newlines | Longest run without `\n` | -|---|---|---|---| -| `f02bedcb…` (14 Jun 21:01, **incident window**) | 15 MB | **0** | **14.57 MB (entire file)** | -| `d8406afb…` | 24 MB | 1,215,838 | 5,295 bytes | -| `4d5523cb…` | 20 MB | 1,098,370 | 7,596 bytes | +PTY disk logs (`~/.agent-farm/logs/*.log`, exact terminal output): -The incident-window session emitted **15 MB with not a single newline**. Byte census of that file: 1,500,892 ESC (`0x1b`), 164,652 CR (`0x0d`), 0 LF — a control-heavy redraw stream (`\e[?1049h \e[2J \e[H \e[?25l … \e[3G \e[5G … \r`). For that session, Tower's `partial` grows to ~15 MB and every incoming frame re-splits ~15 MB. Other sessions (normal newline density, ~20 bytes/line) are unaffected — which is exactly why the bug is intermittent and session-dependent. +| Session | Size | Newlines | Longest run without `\n` | +|---|---|---|---| +| `f2dc55d1…` (**the storming terminal**) | 1.9 MB | **0** | **1.89 MB (entire file)** | +| `f02bedcb…` (incident window) | 15 MB | **0** | 14.57 MB | +| `d8406afb…` / `4d5523cb…` (normal) | 20–24 MB | ~1.1 M | 5–7 KB | -### Why this matches every observation +`f2dc55d1`'s replay = 1.9 MB **> 1 MB** `MAX_QUEUE` → trips backpressure every reconnect. Byte census of the 15 MB `f02bedcb` log: 1,500,892 ESC, 164,652 CR, **0 LF** — a Claude alt-screen redraw stream (`\e[?1049h \e[2J \e[H … \e[3G \r`). -- **CPU climbs ~linearly with uptime, then ~93% (one core saturated).** `partial` length grows ∝ uptime for an actively-redrawing TUI; per-frame cost ∝ `partial`; frame rate roughly constant → CPU ∝ uptime. Node is single-threaded, so one saturating session pegs ~one core. (Note: the "linear" framing rests on two CPU samples; this mechanism is consistent with a linear climb, and also with a faster ramp once a heavy session starts — both fit.) -- **ALL terminals freeze at once.** The O(n²) re-scan runs on the single shared event loop. One heavy session starves every other terminal's I/O behind it. -- **Memory grows only modestly (+76 MB / 10h).** The cost is CPU (repeated scan + GC churn from re-allocating multi-MB strings each frame), not retained memory. The live `partial`s plus GC headroom account for the modest RSS rise. -- **Input still propagates while render-back is broken** (the issue's stray-`e` screenshot). Writing input is O(small) and occasionally slips through; the output pump's O(partial) re-scan dominates and stalls rendering. -- **Restart is NOT reliably effective.** This is the key point in your question. `ShellperReplayBuffer` is *also* unbounded for no-newline streams, so on restart the shellper replays the full ~15 MB, Tower seeds a fresh `partial` from it, and the O(n²) re-scan resumes almost immediately if the heavy session is still redrawing. Restart only "fixes" it when the offending session has gone quiet or ended. That session-dependence is why you're not sure the restart resolves it. +VSCode extension log (`tmp/vscode-log.txt`, 42,143 lines, 21:53→22:51, ~58 min): -### Pathway walk-through (where responsiveness can break) +- **14,026 "WebSocket connected"**, **14,015 "Backpressure exceeded 1 MB — disconnecting for replay"**, **14,026 "Connecting to"** — a 1:1:1 storm. +- **14,017 of the reconnects target the single terminal `f2dc55d1`**; every other terminal appears 1–2× (normal). One terminal drove ~14,000 oversized replays in under an hour. -Output (render-back) path, which is the broken direction: +Note the threshold is low: it takes only **~1 MB** of no-newline output to cross `MAX_QUEUE` and start the storm, so this is easier to hit than the 15 MB extreme suggests. -1. **Shellper:** `pty.onData` → `ShellperReplayBuffer.append(buf)` **[BUG #2: unbounded on no-newline]** → forwards the chunk over the unix socket to Tower (`shellper-process.ts:143-146`). -2. **Tower:** `ShellperClient` parser emits `'data'` → `PtySession.onPtyData` (`pty-session.ts:251-281`): `RingBuffer.pushData` **[BUG #1: O(n²) re-scan + unbounded]**, then a synchronous `fs.writeSync` disk-log per frame **[secondary event-loop blocker]**, then broadcast to WS clients (drops frames when `ws.bufferedAmount ≥ 1 MB` — a symptom under starvation, not a cause), then `emit('data')`. -3. **Client:** WS → VSCode extension terminal client → xterm.js render. +### Why this matches every observation -Input path (stays responsive): WS message → `session.write` → `shellperClient.write` → unix socket → shellper → PTY. Small, cheap; consistent with the observed "input gets through, output frozen." +- **CPU ~93% (one core).** Tower re-builds (`getAll()`) and re-serializes (`encodeData`) a multi-MB replay ~14,000 times on its single event loop; the per-frame O(n²) `split` adds to it. One storming session pegs ~one core. +- **ALL terminals freeze at once.** The storm runs on the shared event loop; every other terminal's I/O starves behind it. +- **Memory rises only modestly.** The cost is CPU + GC churn (repeated multi-MB string allocation), not retained memory. +- **Input still works while render-back is frozen** (the stray-`e` screenshot): writing input is tiny and slips through; the replay storm dominates output. +- **Restart is NOT reliable.** `ShellperReplayBuffer` is also unbounded for no-newline streams, so after restart the shellper replays the multi-MB blob, Tower's fresh `partial` is re-seeded from it, the replay is oversized again, and the client re-enters the storm — unless the offending session has gone quiet/ended. That session-dependence is exactly the "not sure restart fixes it" symptom. -The dominant responsiveness failure is Bug #1 (the O(n²) output re-scan); Bug #2 makes restart unreliable and leaks shellper memory; the per-frame `writeSync` is a smaller additive event-loop cost. +### Pathway summary -### Status of the earlier listener-leak hypothesis +Output (broken render-back direction): shellper `pty.onData` → `ShellperReplayBuffer.append` **[unbounded]** → socket → Tower `onPtyData` → `RingBuffer.pushData` **[O(n²) + unbounded]** + per-frame `fs.writeSync` **[secondary blocker]** → on (re)connect `getAll()` → `ws.send` a multi-MB replay → client `handleData` **[backpressure trips on replay]** → `reconnect()` **[no backoff]** → loop. Input path (WS → `session.write` → shellper) stays cheap and responsive, matching the symptom. -My first draft led with an EventEmitter listener-accumulation theory on the PIR #991 reconnect surface. After empirical measurement, that is **demoted to a secondary, defensive cleanup**: it does not explain the 15 MB zero-newline artifact, the CPU-without-memory profile, or the restart-unreliability, whereas the buffer mechanism explains all three. There remain real-but-minor listener-hygiene gaps (`createSessionRaw` overwrites `sessions` without teardown at `pty-manager.ts:161`; `attachShellper` re-subscribes without removing prior listeners) which are worth a low-risk guard but are not the headline. +The earlier EventEmitter listener-leak hypothesis is **demoted to an optional defensive cleanup** — it explains none of the captured evidence (the oversized replay, the 14k-cycle storm, the restart-unreliability), whereas this chain explains all of it. ## Proposed Change -### Fix 1 (primary) — stop re-scanning, and byte-cap `RingBuffer.partial` +Three coordinated fixes. A + B remove the oversized replay at the source (which alone breaks the storm); C makes the client structurally immune to any future oversized replay (defense-in-depth, and it stops the infinite loop directly). + +### Fix A (primary, Tower) — `RingBuffer.pushData`: scan only new data + byte-cap `partial` + +- Scan only the incoming `data` for newlines instead of re-splitting `partial + data`, making per-frame work O(|data|) and killing the O(n²) re-scan. Behavior-preserving for replay (same lines, same partial). +- Cap `partial` to a byte limit (`MAX_PARTIAL_BYTES`, comfortably **under** the client's 1 MB `MAX_QUEUE` — e.g. 256 KB), front-trimming when an unbroken run exceeds it. This guarantees `getAll()`/replay stays small, so the client never trips backpressure on a replay. Front-trim (not synthetic `\n` injection) avoids corrupting a TUI replay; reconnect drives a full repaint that self-heals the trimmed prefix. -In `RingBuffer.pushData` (`ring-buffer.ts`): +### Fix B (primary, shellper) — `ShellperReplayBuffer`: byte cap -- **Scan only the incoming `data` for newlines**, not the whole `partial + data`. Walk newline indices in `data`, push completed lines (the current `partial` prefixes only the first), and keep the trailing remainder as the new `partial`. This makes per-frame work O(|data|) instead of O(|partial|) and is **behavior-preserving** for replay (same lines, same partial). -- **Cap `partial` to a max byte length** (e.g. a generous `MAX_PARTIAL_BYTES`, on the order of 256 KB–1 MB — large enough that real lines are never clipped, small enough to bound work/memory). When an unbroken run exceeds the cap, trim the front of `partial` to the last `MAX_PARTIAL_BYTES`. Front-trim (rather than injecting a synthetic `\n`) avoids corrupting a TUI replay with spurious line feeds; the slight loss of the alt-screen-enter prefix self-heals because reconnect triggers a full TUI repaint (resize). The exact cap and trim-vs-segment choice is finalized in implementation and validated at the `dev-approval` gate. +Add a `maxBytes` cap alongside `maxLines` (it already tracks `totalBytes`): evict oldest chunks while `totalBytes > maxBytes`, with a single-chunk front-trim edge case mirroring the line logic. Bounds shellper memory for no-newline streams and bounds the REPLAY frame so a restart can't re-seed an oversized buffer — making restart-as-recovery deterministic. -### Fix 2 (primary) — byte-cap `ShellperReplayBuffer` +### Fix C (primary, VSCode client) — break the backpressure infinite loop -In `ShellperReplayBuffer.append` (`shellper-replay-buffer.ts`): add a `maxBytes` cap alongside `maxLines` (it already tracks `totalBytes`). Evict oldest chunks while `totalBytes > maxBytes`, with a single-chunk front-trim edge case mirroring the existing line logic. This bounds shellper memory for no-newline streams and, crucially, bounds the REPLAY frame so a restart can no longer re-seed a multi-MB `partial` — making restart-as-recovery deterministic and keeping it cheap. +`terminal-adapter.ts` must not enter an unbounded, backoff-free reconnect loop when the replay itself exceeds `MAX_QUEUE`. Direction (final mechanism settled in implementation, validated at `dev-approval`): -### Fix 3 (defensive, optional) — listener hygiene +- Do **not** `backoff.reset()` on a backpressure-triggered reconnect; apply normal backoff so repeated backpressure can't busy-loop, and/or add a guard so an immediate re-trip after reconnect does not instantly reconnect again. +- Prefer routing the initial replay burst through the already-paced `writeChunked` path (which yields via `setImmediate`) **outside** the live-backpressure budget, rather than counting it as instantaneous overload. The half-present `pause`/`resume` control protocol (`terminal-adapter.ts:345-355`; server would bracket the replay) is the clean way to mark "this is replay, don't count it" — completing that bracket is the preferred long-term shape. -Make `attachShellper` idempotent (remove prior `data`/`exit`/`close` listeners before re-subscribing) and have the on-the-fly reconnect / `createSessionRaw` path tear down any pre-existing PtySession under a reused id before replacing it. Correct in its own right; low risk. Include only if it does not bloat the change — otherwise spin out to a follow-up issue. +Net: the live-streaming backpressure guard keeps protecting against genuine runaway output, but the **replay** path can no longer cause an infinite disconnect storm. ### Instrumentation (targeted, cheap) -On the existing 30s SSE heartbeat, log per-session `ringBuffer` partial length and shellper replay byte size, and `WARN` when a partial exceeds a threshold (naming the terminal id). This directly observes the now-known cause and confirms the fix holds in production. Far more targeted than the listener-count instrumentation from the first draft. +On the existing 30 s SSE heartbeat, log per-session `ringBuffer` partial length and shellper replay byte size, and `WARN` (with terminal id) when a partial exceeds a threshold. Directly observes this cause and confirms the fix holds in production. ### Explicitly out of scope -- Converting the per-frame `fs.writeSync` disk log to async/batched (a smaller, separate optimization — note for a follow-up if profiling after Fix 1 still shows it). -- Any change to default `tower stop` shellper-survival behavior (#274 / #832 / #999 / #991). -- The cron `ReferenceError` (#1048) and the spawn-failure sibling (#1038). +- Per-frame `fs.writeSync` → async/batched disk log (smaller separate optimization; note for follow-up if profiling after A still shows it). +- The listener-hygiene cleanup (`attachShellper` idempotency, `createSessionRaw` teardown at `pty-manager.ts:161`) — optional; include only if it doesn't bloat the change, else spin to a follow-up issue. +- Default `tower stop` shellper-survival (#274/#832/#999/#991); cron `ReferenceError` (#1048); spawn-failure sibling (#1038). ## Files to Change -- `packages/codev/src/terminal/ring-buffer.ts` — `pushData`: scan only `data`; byte-cap `partial`. Add `MAX_PARTIAL_BYTES`. -- `packages/codev/src/terminal/shellper-replay-buffer.ts` — add `maxBytes` cap + byte-based eviction; thread a sane default (and wire `replayBufferLines`/a new bytes option from `shellper-process.ts:97` if needed). -- `packages/codev/src/agent-farm/servers/tower-server.ts:236-262` — bounded per-tick partial/replay-size instrumentation on the SSE heartbeat. -- (optional, Fix 3) `packages/codev/src/terminal/pty-session.ts` (`attachShellper`) and `packages/codev/src/terminal/pty-manager.ts:161` (`createSessionRaw` teardown). -- Tests under `packages/codev/src/terminal/__tests__/` — see Test Plan. +- `packages/codev/src/terminal/ring-buffer.ts` — `pushData`: scan-only-data + byte-cap `partial`; add `MAX_PARTIAL_BYTES`. +- `packages/codev/src/terminal/shellper-replay-buffer.ts` — add `maxBytes` cap + byte eviction; thread a default (and a bytes option from `shellper-process.ts:97` if needed). +- `packages/vscode/src/terminal-adapter.ts` — backpressure path: backoff/guard + replay-burst handling so it can't infinite-loop (`handleData` ~300-308, `reconnect` ~281-296; possibly complete the `pause`/`resume` replay bracket). +- `packages/codev/src/agent-farm/servers/tower-websocket.ts` — if the `pause`/`resume` replay bracket is adopted, wrap the replay send (`62-75`). +- `packages/codev/src/agent-farm/servers/tower-server.ts:236-262` — partial/replay-size instrumentation on the SSE heartbeat. +- Tests: `packages/codev/src/terminal/__tests__/` and `packages/vscode/` (see Test Plan). +- VSCode CHANGELOG + `docs/releases/UNRELEASED.md` — user-facing entry for the terminal-freeze fix (per repo convention for vscode-affecting changes). ## Risks & Alternatives Considered -- **Risk: byte-trimming `partial` corrupts replay for a no-newline TUI** (loses early escape state). Mitigation: generous cap so only pathological streams are trimmed; reconnect already drives a full repaint; the accelerated test asserts a *normal* newline stream replays byte-identically (no behavior change for the common case). The reviewer validates a live reconnect at the `dev-approval` gate. -- **Risk: the cap is too aggressive and clips legitimately long single lines** (e.g. a 200 KB JSON blob on one line). Mitigation: set the cap well above realistic single-line sizes; it only bounds *pathological* unbroken runs (megabytes). -- **Risk: there is a second, independent leak** (e.g. the listener gap) that this doesn't cover. Mitigation: the instrumentation ships regardless and will surface any residual growth; Fix 3 covers the known listener gap defensively. -- **Alternative — only add instrumentation and bisect v3.1.5..v3.1.7 first.** Rejected as primary: the disk-log evidence already pins the mechanism without a 12h+ soak, and bisect would land no fix. Bisect remains a fallback if post-fix soak still shows CPU growth. -- **Alternative — segment the over-cap partial as ring lines instead of front-trimming.** Rejected (default) because rejoining ring lines with `\n` injects spurious line feeds into a TUI replay; front-trim avoids that. Revisit if a continuation-aware ring model is wanted later. +- **Risk: byte-trimming `partial` corrupts a no-newline TUI replay** (loses early escape state). Mitigation: reconnect drives a full TUI repaint; the accelerated test asserts a *normal* newline stream replays byte-identically (no change for the common case); the reviewer validates a live reconnect at `dev-approval`. +- **Risk: the cap clips a legitimately long single line** (e.g. a 200 KB single-line JSON). Mitigation: `MAX_PARTIAL_BYTES` set well above realistic single-line sizes but under `MAX_QUEUE`; it only bounds pathological multi-MB unbroken runs. +- **Risk: Fix C alone (without A/B) only hides the storm.** Mitigation: A/B remove the oversized replay at the source; C is defense-in-depth. We ship all three so neither a Tower-side nor a client-side regression can resurrect the storm. +- **Alternative — only instrument + bisect v3.1.5..v3.1.7.** Rejected as primary: captured data already pins the mechanism without a 12 h+ soak and bisect lands no fix. Bisect stays a fallback if a post-fix soak still shows growth. +- **Alternative — raise the client `MAX_QUEUE`.** Rejected: it only moves the threshold; a larger no-newline stream still crosses it, and it does nothing for the Tower-side O(n²)/memory. ## Test Plan -Run from the worktree (`pnpm --filter @cluesmith/codev …`), not the main checkout. +Run from the worktree (`pnpm --filter @cluesmith/codev …` for core; build the VSCode package for C). -- **Unit — CPU bound (the core fix):** feed a synthetic no-newline stream of M bytes across K frames into `RingBuffer.pushData`; assert per-frame work does not scale with accumulated size (e.g. assert `partial.length` stays ≤ cap, and a frame-cost proxy stays flat as the stream grows). Today this test would show partial growing to M and unbounded re-scan; after the fix it stays bounded. -- **Unit — replay correctness (no regression):** a normal newline-delimited stream produces byte-identical `getAll()` / `getSince()` output before and after the change. -- **Unit — shellper buffer:** feed a zero-newline stream exceeding `maxBytes`; assert `ShellperReplayBuffer.size` stays ≤ `maxBytes` and `getReplayData()` returns the bounded tail. -- **Build + full suite:** `pnpm --filter @cluesmith/codev build` and `… test` green from the worktree. -- **Manual / live (at `dev-approval`):** start Tower on this branch, open an architect terminal running Claude's full-screen UI, let it redraw for a while, and watch `afx tower log -f`: the new instrumentation should show the partial size **plateau at the cap** rather than climb, and Tower CPU should stay low. Reconnect the terminal (close/reopen the VSCode tab) and confirm the screen repaints correctly (replay still works). Optionally stop/start Tower against a still-busy session and confirm CPU does **not** immediately re-saturate (Fix 2 working). -- **Soak (post-merge / verify, non-gating):** leave Tower running several hours on a real workload; confirm CPU stays flat. This is the true end-to-end confirmation but cannot fit a gate session. +- **Unit — CPU bound (core):** feed a synthetic no-newline stream of M bytes / K frames into `RingBuffer.pushData`; assert `partial.length` stays ≤ cap and per-frame cost stays flat as the stream grows. (Today: partial → M, unbounded re-scan.) +- **Unit — replay correctness (no regression):** a normal newline stream yields byte-identical `getAll()`/`getSince()` before and after. +- **Unit — shellper buffer:** zero-newline stream exceeding `maxBytes` keeps `ShellperReplayBuffer.size ≤ maxBytes`; `getReplayData()` returns the bounded tail. +- **Unit/component — client loop:** simulate a replay frame > `MAX_QUEUE` into `terminal-adapter`'s data path; assert it does **not** enter an unbounded instant-reconnect loop (bounded attempts / backoff applied / replay paced), and that the terminal still renders. +- **Build + suites:** core `pnpm --filter @cluesmith/codev build && … test`; build the VSCode extension. +- **Manual / live (at `dev-approval`):** start Tower on this branch; open an architect terminal running Claude's full-screen UI; let it redraw past ~1 MB of no-newline output. Confirm via `afx tower log -f` that the partial size **plateaus at the cap** and Tower CPU stays low, and via the VSCode terminal log that there is **no** connect→backpressure→reconnect storm (contrast with the 14k-cycle capture). Reconnect the tab and confirm the screen repaints correctly. Optionally restart Tower against a still-busy session and confirm CPU does **not** re-saturate. +- **Soak (post-merge / verify, non-gating):** several hours on a real workload; CPU stays flat, no storms in the VSCode log. ## Open Questions for the Reviewer -1. **Cap sizes:** proposed `MAX_PARTIAL_BYTES` ~256 KB–1 MB and `ShellperReplayBuffer maxBytes` ~ a few MB. Comfortable with these, or want them configurable via env? -2. **Fix 3 (listener hygiene):** include in this PR as a small defensive guard, or split to a follow-up issue to keep this change tightly scoped to the buffer fix? +1. **Cap sizes:** `MAX_PARTIAL_BYTES` ~256 KB (must stay < the client's 1 MB `MAX_QUEUE`) and `ShellperReplayBuffer maxBytes` ~ a few MB. OK, or make them env-configurable? +2. **Fix C shape:** minimal (backoff + re-trip guard) now, or the fuller `pause`/`resume` replay-bracket so replay never counts toward backpressure? The latter is cleaner but touches both client and Tower. +3. **Listener hygiene:** include the optional defensive cleanup here, or split to a follow-up to keep this PR scoped to the storm fix? diff --git a/codev/state/pir-1047_thread.md b/codev/state/pir-1047_thread.md index c076a4cd6..f3d7e2440 100644 --- a/codev/state/pir-1047_thread.md +++ b/codev/state/pir-1047_thread.md @@ -30,3 +30,13 @@ Architect pushed back: do we have a real hypothesis, and is restart actually the Primary: (1) RingBuffer.pushData scan only `data` + byte-cap `partial` (front-trim); (2) ShellperReplayBuffer byte cap. Secondary/optional: listener hygiene (attachShellper idempotent, createSessionRaw teardown). Plus targeted instrumentation (log partial/replay sizes on the 30s heartbeat). Accelerated unit tests compress the leak into CI; replay-correctness test guards no-regression. Plan rewritten in codev/plans/1047-tower-terminals-architects-bui.md — still at plan-approval gate, awaiting review. Two open questions for reviewer: cap sizes, and whether to include listener-hygiene here or split it out. + +### CLINCHER: client-side replay storm (2026-06-15, from VSCode log) +Architect supplied tmp/vscode-log.txt (42,143 lines, 58 min) + screenshot. Storm confirmed: +- 14,026 "WebSocket connected" / 14,015 "Backpressure exceeded 1MB" / 14,026 "Connecting to" (1:1:1). +- 14,017 reconnects target ONE terminal: f2dc55d1. Its disk log = 1.9 MB, 0 newlines → replay 1.9 MB > client MAX_QUEUE (1 MB). +- Client handleData (terminal-adapter.ts:300-308): queuedBytes += payload; if >1MB → reconnect(). reconnect() (281-296) does backoff.reset() → instant retry, no backoff. Tower re-sends the 1.9 MB replay (tower-websocket.ts:62-65 getAll() = [partial]) → infinite loop. +- This replay storm (Tower re-serializing multi-MB replay ~14k×/hr on the single event loop) is the DOMINANT CPU driver; O(n²) split is secondary. Only ~1 MB of no-newline output is needed to trigger. + +### Plan now has 3 coordinated fixes +A: RingBuffer.pushData scan-only-data + byte-cap partial (< MAX_QUEUE). B: ShellperReplayBuffer byte cap. C: client backpressure path must not infinite-loop (backoff + re-trip guard; ideally pause/resume bracket so replay isn't counted as live backpressure). Plus instrumentation. Still at plan-approval gate. From 25ebf8ad50ac0eeb3bf1ce47e3b8ad8a279c379d Mon Sep 17 00:00:00 2001 From: Amr Elsayed Date: Mon, 15 Jun 2026 09:09:43 +1000 Subject: [PATCH 06/31] [PIR #1047] Plan: full remote-resilient fix set (resume-delta, drop-not-reconnect, listener hygiene) --- .../1047-tower-terminals-architects-bui.md | 128 ++++++++++-------- codev/state/pir-1047_thread.md | 11 ++ 2 files changed, 83 insertions(+), 56 deletions(-) diff --git a/codev/plans/1047-tower-terminals-architects-bui.md b/codev/plans/1047-tower-terminals-architects-bui.md index bb90624cd..4cd57a569 100644 --- a/codev/plans/1047-tower-terminals-architects-bui.md +++ b/codev/plans/1047-tower-terminals-architects-bui.md @@ -11,9 +11,9 @@ A full-screen TUI (Claude Code's UI) runs in the alternate screen buffer and red The closed causal loop: 1. **Unbounded accumulation.** `RingBuffer.partial` (`ring-buffer.ts:36-51`) does `combined = this.partial + data; combined.split('\n')` every frame and keeps the trailing fragment; with no `\n`, `partial` grows without bound and each frame re-scans the whole thing (O(partial)/frame → O(n²)/session). The shellper's `ShellperReplayBuffer` (`shellper-replay-buffer.ts:45,58`, fed at `shellper-process.ts:143`) evicts only `while lineCount > maxLines`, so with zero newlines it never evicts. Neither has a byte cap (the only byte cap in the layer is the *stderr* ring's `maxLineLength=10000` at `session-manager.ts:67`). -2. **Oversized replay.** On every (re)connect, Tower sends the buffer as one frame: `ws.send(encodeData(replayLines.join('\n')))` (`tower-websocket.ts:62-65`), where `replayLines = ringBuffer.getAll()` = `[partial]`. For a no-newline session that is the entire accumulated blob. +2. **Oversized replay.** On every (re)connect, Tower sends the buffer as one frame: `ws.send(encodeData(replayLines.join('\n')))` (`tower-websocket.ts:62-65`), where `replayLines = ringBuffer.getAll()` = `[partial]`. 3. **Client backpressure trips on the replay itself.** The VSCode client's `handleData` (`packages/vscode/src/terminal-adapter.ts:300-308`) does `this.queuedBytes += payload.length; if (this.queuedBytes > MAX_QUEUE /*1 MB*/) { reconnect() }`. The replay frame alone exceeds 1 MB, so it disconnects **before rendering anything**. -4. **Reconnect with no backoff.** `reconnect()` (`terminal-adapter.ts:281-296`) calls `this.backoff.reset()` and reconnects immediately → Tower re-sends the same oversized replay → back to step 3. Infinite loop. +4. **Reconnect with no backoff.** `reconnect()` (`terminal-adapter.ts:281-296`) calls `this.backoff.reset()` and reconnects immediately, opening the bare `/ws/terminal/` (no `resume`) → Tower re-sends the same full oversized replay → back to step 3. Infinite loop. ### Captured evidence @@ -25,93 +25,109 @@ PTY disk logs (`~/.agent-farm/logs/*.log`, exact terminal output): | `f02bedcb…` (incident window) | 15 MB | **0** | 14.57 MB | | `d8406afb…` / `4d5523cb…` (normal) | 20–24 MB | ~1.1 M | 5–7 KB | -`f2dc55d1`'s replay = 1.9 MB **> 1 MB** `MAX_QUEUE` → trips backpressure every reconnect. Byte census of the 15 MB `f02bedcb` log: 1,500,892 ESC, 164,652 CR, **0 LF** — a Claude alt-screen redraw stream (`\e[?1049h \e[2J \e[H … \e[3G \r`). +`f2dc55d1`'s replay = 1.9 MB **> 1 MB** `MAX_QUEUE`. Byte census of the 15 MB log: 1,500,892 ESC, 164,652 CR, **0 LF** — a Claude alt-screen redraw stream (`\e[?1049h \e[2J \e[H … \e[3G \r`). -VSCode extension log (`tmp/vscode-log.txt`, 42,143 lines, 21:53→22:51, ~58 min): - -- **14,026 "WebSocket connected"**, **14,015 "Backpressure exceeded 1 MB — disconnecting for replay"**, **14,026 "Connecting to"** — a 1:1:1 storm. -- **14,017 of the reconnects target the single terminal `f2dc55d1`**; every other terminal appears 1–2× (normal). One terminal drove ~14,000 oversized replays in under an hour. - -Note the threshold is low: it takes only **~1 MB** of no-newline output to cross `MAX_QUEUE` and start the storm, so this is easier to hit than the 15 MB extreme suggests. +VSCode extension log (`tmp/vscode-log.txt`, 42,143 lines, ~58 min): **14,026** "WebSocket connected" / **14,015** "Backpressure exceeded 1 MB" / **14,026** "Connecting to" (1:1:1), **14,017** of them targeting the single terminal `f2dc55d1`. Only **~1 MB** of no-newline output is needed to start the storm. ### Why this matches every observation -- **CPU ~93% (one core).** Tower re-builds (`getAll()`) and re-serializes (`encodeData`) a multi-MB replay ~14,000 times on its single event loop; the per-frame O(n²) `split` adds to it. One storming session pegs ~one core. -- **ALL terminals freeze at once.** The storm runs on the shared event loop; every other terminal's I/O starves behind it. -- **Memory rises only modestly.** The cost is CPU + GC churn (repeated multi-MB string allocation), not retained memory. -- **Input still works while render-back is frozen** (the stray-`e` screenshot): writing input is tiny and slips through; the replay storm dominates output. -- **Restart is NOT reliable.** `ShellperReplayBuffer` is also unbounded for no-newline streams, so after restart the shellper replays the multi-MB blob, Tower's fresh `partial` is re-seeded from it, the replay is oversized again, and the client re-enters the storm — unless the offending session has gone quiet/ended. That session-dependence is exactly the "not sure restart fixes it" symptom. +- **CPU ~93% (one core), all terminals freeze at once:** Tower re-builds + re-serializes a multi-MB replay ~14,000 times on its single event loop, starving every other terminal. Input stays responsive (tiny); output is frozen — matching the stray-`e` screenshot. +- **Memory barely moves:** the cost is CPU + GC churn, not retained memory. +- **Restart is NOT reliable:** `ShellperReplayBuffer` is also unbounded, so after restart the shellper replays the multi-MB blob, Tower re-seeds an oversized `partial`, and the storm resumes unless the offending session has gone quiet/ended. -### Pathway summary +The earlier EventEmitter listener-leak hypothesis is **demoted to a defensive cleanup** (folded in below) — it explains none of the captured evidence, whereas this chain explains all of it. -Output (broken render-back direction): shellper `pty.onData` → `ShellperReplayBuffer.append` **[unbounded]** → socket → Tower `onPtyData` → `RingBuffer.pushData` **[O(n²) + unbounded]** + per-frame `fs.writeSync` **[secondary blocker]** → on (re)connect `getAll()` → `ws.send` a multi-MB replay → client `handleData` **[backpressure trips on replay]** → `reconnect()` **[no backoff]** → loop. Input path (WS → `session.write` → shellper) stays cheap and responsive, matching the symptom. +### Local vs remote hosting raises the stakes -The earlier EventEmitter listener-leak hypothesis is **demoted to an optional defensive cleanup** — it explains none of the captured evidence (the oversized replay, the 14k-cycle storm, the restart-unreliability), whereas this chain explains all of it. +Tower can run on loopback (same machine as VSCode) or remotely (cloud / via the tunnel). Three things flip when remote, and they shape the fix: + +- **Reconnects are frequent remotely** (WAN blips, tunnel timeouts, sleep/wake) — so a *full replay on every reconnect* is paid routinely, not just during the storm. Reconnect-path leaks also accumulate proportionally faster. +- **Bytes cost money and bandwidth remotely** — 14k × 1.9 MB ≈ 26 GB/hr for one terminal; even a *bounded* buffer re-shipped on every reconnect is wasteful. +- **"Reconnect to relieve backpressure" is harmful remotely** — it re-downloads the whole buffer over the congested link. Notably Tower's *live-output* side already does the resilient thing (`if (ws.bufferedAmount < WS_HIGH_WATER_MARK) ws.send(...)` — it **drops** ephemeral frames, `tower-websocket.ts:47`); the client's remedy is the opposite and must be brought into line. ## Proposed Change -Three coordinated fixes. A + B remove the oversized replay at the source (which alone breaks the storm); C makes the client structurally immune to any future oversized replay (defense-in-depth, and it stops the infinite loop directly). +One coordinated change built on four resilience principles (correct regardless of hosting; load-bearing remotely): bound by bytes, make reconnect a cheap delta, separate replay from live overload, and keep a hard last-resort safety net. + +### Fix A (Tower) — `RingBuffer`: scan-only, byte-cap, and byte-addressable resume -### Fix A (primary, Tower) — `RingBuffer.pushData`: scan only new data + byte-cap `partial` +- **Scan only the incoming `data`** for newlines instead of re-splitting `partial + data` → per-frame work O(|data|), killing the O(n²) re-scan. Behavior-preserving for replay. +- **Byte-cap `partial`** (`MAX_PARTIAL_BYTES`), front-trimming an over-cap unbroken run. Bounds CPU/memory/bandwidth. Front-trim (not synthetic `\n`) avoids corrupting a TUI replay; reconnect drives a full repaint that self-heals. +- **Byte-addressable sequence:** advance a monotonic byte counter on every `pushData` (not only on completed lines), and make `getSince(seq)` return the bytes after that offset (reconstructed across the retained lines + partial). Today `seq` only moves on a completed line, so for a no-newline stream `currentSeq` stays 0 and resume returns the whole partial — defeating resume exactly when it's needed most. This is what makes Fix C's delta-reconnect actually work for TUI streams. If a client's position is older than the retained window, fall back to a bounded full replay (≤ cap). -- Scan only the incoming `data` for newlines instead of re-splitting `partial + data`, making per-frame work O(|data|) and killing the O(n²) re-scan. Behavior-preserving for replay (same lines, same partial). -- Cap `partial` to a byte limit (`MAX_PARTIAL_BYTES`, comfortably **under** the client's 1 MB `MAX_QUEUE` — e.g. 256 KB), front-trimming when an unbroken run exceeds it. This guarantees `getAll()`/replay stays small, so the client never trips backpressure on a replay. Front-trim (not synthetic `\n` injection) avoids corrupting a TUI replay; reconnect drives a full repaint that self-heals the trimmed prefix. +### Fix B (shellper) — `ShellperReplayBuffer`: byte cap -### Fix B (primary, shellper) — `ShellperReplayBuffer`: byte cap +Add a `maxBytes` cap alongside `maxLines` (it already tracks `totalBytes`): evict oldest chunks while `totalBytes > maxBytes`, with a single-chunk front-trim edge case mirroring the line logic. Bounds shellper memory and the REPLAY frame so restart can't re-seed an oversized buffer. -Add a `maxBytes` cap alongside `maxLines` (it already tracks `totalBytes`): evict oldest chunks while `totalBytes > maxBytes`, with a single-chunk front-trim edge case mirroring the line logic. Bounds shellper memory for no-newline streams and bounds the REPLAY frame so a restart can't re-seed an oversized buffer — making restart-as-recovery deterministic. +### Fix C (VSCode client) — reconnect/backpressure redesign (`terminal-adapter.ts`) -### Fix C (primary, VSCode client) — break the backpressure infinite loop +- **Delta reconnect:** reconnect with `?resume=` (the client already tracks `lastSeq` from `seq` control frames — it's currently dead code) so a reconnect ships only new bytes, not the whole buffer. The biggest remote win. +- **Replay excluded from backpressure:** deliver the replay burst through the already-paced `writeChunked` path (yields via `setImmediate`) and do **not** count it toward `queuedBytes`. Use the half-present `pause`/`resume` control bracket (`terminal-adapter.ts:345-355`) to mark "this is replay" so the client renders any replay size without tripping. This also removes the hard cross-package `MAX_PARTIAL_BYTES < MAX_QUEUE` coupling — once replay never counts as backpressure, the caps are about Tower cost, not client safety. +- **Live overload drops, not reconnects:** when *live* (post-replay) output exceeds the budget, drop/coalesce frames (terminal output is ephemeral; the next repaint heals it), mirroring Tower's `bufferedAmount` drop. Reconnect becomes a last resort, never the routine remedy. +- **Hard safety net kept:** do not `backoff.reset()` on a backpressure-driven reconnect; retain the backoff curve + give-up so no Tower bug, tunnel flap, or future regression can busy-loop. -`terminal-adapter.ts` must not enter an unbounded, backoff-free reconnect loop when the replay itself exceeds `MAX_QUEUE`. Direction (final mechanism settled in implementation, validated at `dev-approval`): +### Fix D (Tower) — replay bracket + resume wiring (`tower-websocket.ts`) -- Do **not** `backoff.reset()` on a backpressure-triggered reconnect; apply normal backoff so repeated backpressure can't busy-loop, and/or add a guard so an immediate re-trip after reconnect does not instantly reconnect again. -- Prefer routing the initial replay burst through the already-paced `writeChunked` path (which yields via `setImmediate`) **outside** the live-backpressure budget, rather than counting it as instantaneous overload. The half-present `pause`/`resume` control protocol (`terminal-adapter.ts:345-355`; server would bracket the replay) is the clean way to mark "this is replay, don't count it" — completing that bracket is the preferred long-term shape. +Bracket the replay send with `pause` → replay data → `resume` control frames so the client can exclude it from backpressure; honor `?resume=` via `attachResume` (already present) using Fix A's byte-aware `getSince`; emit the byte-aware `seq`. -Net: the live-streaming backpressure guard keeps protecting against genuine runaway output, but the **replay** path can no longer cause an infinite disconnect storm. +### Fix E (Tower) — reconnect-path listener hygiene (folded in) -### Instrumentation (targeted, cheap) +`attachShellper` (`pty-session.ts:119-197`) becomes idempotent (remove prior `data`/`exit`/`close` listeners before re-subscribing); the on-the-fly reconnect / `createSessionRaw` path (`pty-manager.ts:161`) tears down any pre-existing PtySession under a reused id before replacing it. Frequent remote reconnects make this matter more than it does locally. -On the existing 30 s SSE heartbeat, log per-session `ringBuffer` partial length and shellper replay byte size, and `WARN` (with terminal id) when a partial exceeds a threshold. Directly observes this cause and confirms the fix holds in production. +### Cap configuration + invariant + +`MAX_PARTIAL_BYTES` (~256 KB) and `ShellperReplayBuffer maxBytes` (~ a few MB) configurable via env for cloud tuning. With Fix C the `< MAX_QUEUE` relationship is no longer correctness-critical, but keep a startup assertion/log if both are set so a misconfig is visible rather than silent. + +### Instrumentation + +On the existing 30 s SSE heartbeat, log per-session `ringBuffer` partial length + shellper replay byte size, and `WARN` (with terminal id) when a partial exceeds a threshold. ### Explicitly out of scope -- Per-frame `fs.writeSync` → async/batched disk log (smaller separate optimization; note for follow-up if profiling after A still shows it). -- The listener-hygiene cleanup (`attachShellper` idempotency, `createSessionRaw` teardown at `pty-manager.ts:161`) — optional; include only if it doesn't bloat the change, else spin to a follow-up issue. +- Per-frame `fs.writeSync` → async/batched disk log (separate optimization; note for follow-up if profiling after A still shows it). - Default `tower stop` shellper-survival (#274/#832/#999/#991); cron `ReferenceError` (#1048); spawn-failure sibling (#1038). ## Files to Change -- `packages/codev/src/terminal/ring-buffer.ts` — `pushData`: scan-only-data + byte-cap `partial`; add `MAX_PARTIAL_BYTES`. -- `packages/codev/src/terminal/shellper-replay-buffer.ts` — add `maxBytes` cap + byte eviction; thread a default (and a bytes option from `shellper-process.ts:97` if needed). -- `packages/vscode/src/terminal-adapter.ts` — backpressure path: backoff/guard + replay-burst handling so it can't infinite-loop (`handleData` ~300-308, `reconnect` ~281-296; possibly complete the `pause`/`resume` replay bracket). -- `packages/codev/src/agent-farm/servers/tower-websocket.ts` — if the `pause`/`resume` replay bracket is adopted, wrap the replay send (`62-75`). -- `packages/codev/src/agent-farm/servers/tower-server.ts:236-262` — partial/replay-size instrumentation on the SSE heartbeat. -- Tests: `packages/codev/src/terminal/__tests__/` and `packages/vscode/` (see Test Plan). -- VSCode CHANGELOG + `docs/releases/UNRELEASED.md` — user-facing entry for the terminal-freeze fix (per repo convention for vscode-affecting changes). +- `packages/codev/src/terminal/ring-buffer.ts` — scan-only `pushData`; byte-cap `partial`; byte-addressable `seq` + `getSince`. `MAX_PARTIAL_BYTES`. +- `packages/codev/src/terminal/shellper-replay-buffer.ts` — `maxBytes` cap + byte eviction; thread default/option from `shellper-process.ts:97`. +- `packages/vscode/src/terminal-adapter.ts` — delta reconnect (`?resume=`), replay-excluded-from-backpressure (`pause`/`resume` bracket), live-overload drop, keep backoff/give-up (`handleData` ~300-321, `reconnect` ~281-296, `connect` ~130-160). +- `packages/codev/src/agent-farm/servers/tower-websocket.ts` — bracket replay with `pause`/`resume`; resume via byte-aware `getSince`; emit byte-aware `seq` (`53-75`). +- `packages/codev/src/terminal/pty-session.ts` — `attachShellper` idempotency (Fix E). +- `packages/codev/src/terminal/pty-manager.ts:161` — `createSessionRaw` teardown of reused-id session (Fix E). +- `packages/codev/src/agent-farm/servers/tower-server.ts:236-262` — instrumentation + cap-config validation. +- Possibly `@cluesmith/codev-types` / `ws-protocol` — only if the `pause`/`resume`/`seq` control shapes need additions (they appear to exist already). +- Tests under `packages/codev/src/terminal/__tests__/`, `packages/codev/src/agent-farm/__tests__/`, and `packages/vscode/` (see Test Plan). +- `packages/vscode/CHANGELOG.md` + `docs/releases/UNRELEASED.md` — user-facing entry. + +This is a large, multi-package change. It is sequenced so commits are independently reviewable (A → B → D → C → E → instrumentation), and the `dev-approval` reviewer can exercise it incrementally. The minimal safety net (Fix C backoff/give-up) means even a partial landing cannot regress into a storm. ## Risks & Alternatives Considered -- **Risk: byte-trimming `partial` corrupts a no-newline TUI replay** (loses early escape state). Mitigation: reconnect drives a full TUI repaint; the accelerated test asserts a *normal* newline stream replays byte-identically (no change for the common case); the reviewer validates a live reconnect at `dev-approval`. -- **Risk: the cap clips a legitimately long single line** (e.g. a 200 KB single-line JSON). Mitigation: `MAX_PARTIAL_BYTES` set well above realistic single-line sizes but under `MAX_QUEUE`; it only bounds pathological multi-MB unbroken runs. -- **Risk: Fix C alone (without A/B) only hides the storm.** Mitigation: A/B remove the oversized replay at the source; C is defense-in-depth. We ship all three so neither a Tower-side nor a client-side regression can resurrect the storm. -- **Alternative — only instrument + bisect v3.1.5..v3.1.7.** Rejected as primary: captured data already pins the mechanism without a 12 h+ soak and bisect lands no fix. Bisect stays a fallback if a post-fix soak still shows growth. -- **Alternative — raise the client `MAX_QUEUE`.** Rejected: it only moves the threshold; a larger no-newline stream still crosses it, and it does nothing for the Tower-side O(n²)/memory. +- **Risk: byte-addressable resume is subtle** (reconstructing the delta across the line/partial boundary; off-by-one re-sends or gaps). Mitigation: unit tests asserting `getSince(seq)` returns exactly the bytes after `seq` for newline and no-newline streams; the bounded-full-replay fallback covers the out-of-window case; the backpressure-exclusion (Fix C) means even a wrong-but-large resume can't storm. +- **Risk: front-trimming `partial` drops early escape state** (e.g. alt-screen enter). Mitigation: reconnect drives a full TUI repaint; replay-correctness test asserts a normal newline stream is byte-identical before/after. +- **Risk: dropping live frames loses output.** Mitigation: only under genuine sustained overload (same regime Tower already drops in); terminal output is ephemeral and repainted; this is strictly better than the freeze it replaces. +- **Risk: PR size / blast radius across core + shellper + vscode.** Mitigation: sequenced commits; each fix is independently correct; behind no feature flag but each is small and locally testable; `dev-approval` gates the running result. +- **Alternative — land A/B/C-minimal now, resume/E as fast-follow.** Considered and set aside per decision to do it all together; the remote-resilience value of resume + clean reconnect path is the point. +- **Alternative — just raise `MAX_QUEUE`.** Rejected: moves the threshold, fixes neither the Tower O(n²)/memory nor the remote bandwidth cost. ## Test Plan -Run from the worktree (`pnpm --filter @cluesmith/codev …` for core; build the VSCode package for C). +Run from the worktree (`pnpm --filter @cluesmith/codev …` for core; build the VSCode package). -- **Unit — CPU bound (core):** feed a synthetic no-newline stream of M bytes / K frames into `RingBuffer.pushData`; assert `partial.length` stays ≤ cap and per-frame cost stays flat as the stream grows. (Today: partial → M, unbounded re-scan.) -- **Unit — replay correctness (no regression):** a normal newline stream yields byte-identical `getAll()`/`getSince()` before and after. -- **Unit — shellper buffer:** zero-newline stream exceeding `maxBytes` keeps `ShellperReplayBuffer.size ≤ maxBytes`; `getReplayData()` returns the bounded tail. -- **Unit/component — client loop:** simulate a replay frame > `MAX_QUEUE` into `terminal-adapter`'s data path; assert it does **not** enter an unbounded instant-reconnect loop (bounded attempts / backoff applied / replay paced), and that the terminal still renders. -- **Build + suites:** core `pnpm --filter @cluesmith/codev build && … test`; build the VSCode extension. -- **Manual / live (at `dev-approval`):** start Tower on this branch; open an architect terminal running Claude's full-screen UI; let it redraw past ~1 MB of no-newline output. Confirm via `afx tower log -f` that the partial size **plateaus at the cap** and Tower CPU stays low, and via the VSCode terminal log that there is **no** connect→backpressure→reconnect storm (contrast with the 14k-cycle capture). Reconnect the tab and confirm the screen repaints correctly. Optionally restart Tower against a still-busy session and confirm CPU does **not** re-saturate. -- **Soak (post-merge / verify, non-gating):** several hours on a real workload; CPU stays flat, no storms in the VSCode log. +- **Unit — RingBuffer:** (a) no-newline stream keeps `partial.length ≤ cap` and per-frame cost flat; (b) normal newline stream replays byte-identically (no regression); (c) `getSince(seq)` returns exactly the post-`seq` bytes for both newline and no-newline streams, with the bounded fallback when out of window. +- **Unit — ShellperReplayBuffer:** zero-newline stream over `maxBytes` keeps `size ≤ maxBytes`; `getReplayData()` returns the bounded tail. +- **Unit/component — client (`terminal-adapter`):** (a) a replay frame > `MAX_QUEUE` renders (paced) and does **not** enter a reconnect loop; (b) reconnect issues `?resume=`; (c) sustained live overload drops/coalesces rather than reconnecting; (d) repeated genuine backpressure backs off and eventually gives up (no busy-loop). +- **Unit — Fix E:** many reconnect/re-attach cycles keep shellper-client listener counts and live PtySession count bounded; a single data frame is processed exactly once. +- **Build + suites:** core build + test green; VSCode extension builds. +- **Manual / live (at `dev-approval`), both hosting modes:** + - *Local:* open an architect terminal running Claude's full-screen UI; drive it past ~1 MB of no-newline output. Confirm via `afx tower log -f` that partial size plateaus at the cap and CPU stays low, and via the VSCode terminal log that there is **no** connect→backpressure→reconnect storm (contrast the 14k-cycle capture). Reconnect the tab and confirm correct repaint and that the reconnect used `resume` (delta, small payload). + - *Remote-ish:* repeat against a Tower reached over the tunnel (or with induced latency/drops); confirm reconnects ship deltas not full buffers, no storm, and the terminal stays usable across forced disconnects. +- **Soak (post-merge / verify, non-gating):** several hours on a real workload; CPU flat, no storms in the VSCode log. -## Open Questions for the Reviewer +## Decisions (confirmed with architect — "do all together") -1. **Cap sizes:** `MAX_PARTIAL_BYTES` ~256 KB (must stay < the client's 1 MB `MAX_QUEUE`) and `ShellperReplayBuffer maxBytes` ~ a few MB. OK, or make them env-configurable? -2. **Fix C shape:** minimal (backoff + re-trip guard) now, or the fuller `pause`/`resume` replay-bracket so replay never counts toward backpressure? The latter is cleaner but touches both client and Tower. -3. **Listener hygiene:** include the optional defensive cleanup here, or split to a follow-up to keep this PR scoped to the storm fix? +1. **Caps:** env-configurable; `MAX_PARTIAL_BYTES` ~256 KB, `ShellperReplayBuffer maxBytes` ~ a few MB; invariant no longer correctness-critical thanks to Fix C, but surfaced if misconfigured. +2. **Fix C:** the full design — delta `resume` reconnect + replay-excluded-from-backpressure + live-overload-drops + retained backoff/give-up. +3. **Listener hygiene (Fix E):** folded into this PR. +4. **Resume/delta reconnect (incl. byte-addressable seq):** included — the remote-resilience centerpiece. diff --git a/codev/state/pir-1047_thread.md b/codev/state/pir-1047_thread.md index f3d7e2440..4a62ed742 100644 --- a/codev/state/pir-1047_thread.md +++ b/codev/state/pir-1047_thread.md @@ -40,3 +40,14 @@ Architect supplied tmp/vscode-log.txt (42,143 lines, 58 min) + screenshot. Storm ### Plan now has 3 coordinated fixes A: RingBuffer.pushData scan-only-data + byte-cap partial (< MAX_QUEUE). B: ShellperReplayBuffer byte cap. C: client backpressure path must not infinite-loop (backoff + re-trip guard; ideally pause/resume bracket so replay isn't counted as live backpressure). Plus instrumentation. Still at plan-approval gate. + +### Scope decision (2026-06-15): "do all together" — remote-resilient full fix +Architect asked for the most resilient approach given Tower may be hosted locally OR remotely (cloud/tunnel). Folded ALL into one coordinated PR: +- A: RingBuffer scan-only + byte-cap partial + **byte-addressable seq/getSince** (so resume works for no-newline streams). +- B: ShellperReplayBuffer byte cap. +- C: client redesign — **delta reconnect (?resume=lastSeq)** + replay excluded from backpressure (pause/resume bracket) + **live overload DROPS not reconnects** (mirror Tower bufferedAmount drop) + keep backoff/give-up safety net. +- D: Tower brackets replay with pause/resume + honors resume via byte-aware getSince. +- E: listener hygiene folded in (attachShellper idempotent, createSessionRaw teardown) — matters more remotely (frequent reconnects). +- Caps env-configurable; Fix C removes the hard MAX_PARTIAL_BYTES Date: Mon, 15 Jun 2026 09:10:56 +1000 Subject: [PATCH 07/31] chore(porch): 1047 plan-approval gate-approved --- .../projects/1047-tower-terminals-architects-bui/status.yaml | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/codev/projects/1047-tower-terminals-architects-bui/status.yaml b/codev/projects/1047-tower-terminals-architects-bui/status.yaml index 773bb42aa..fbfa3f39f 100644 --- a/codev/projects/1047-tower-terminals-architects-bui/status.yaml +++ b/codev/projects/1047-tower-terminals-architects-bui/status.yaml @@ -6,8 +6,9 @@ plan_phases: [] current_plan_phase: null gates: plan-approval: - status: pending + status: approved requested_at: '2026-06-14T22:20:45.222Z' + approved_at: '2026-06-14T23:10:56.668Z' dev-approval: status: pending pr: @@ -16,4 +17,4 @@ iteration: 1 build_complete: false history: [] started_at: '2026-06-14T22:12:29.323Z' -updated_at: '2026-06-14T22:20:45.223Z' +updated_at: '2026-06-14T23:10:56.669Z' From 160f2ac0dbe90ba4f02381a20d4b0a122ffb03d1 Mon Sep 17 00:00:00 2001 From: Amr Elsayed Date: Mon, 15 Jun 2026 09:12:05 +1000 Subject: [PATCH 08/31] chore(porch): 1047 implement phase-transition --- .../projects/1047-tower-terminals-architects-bui/status.yaml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/codev/projects/1047-tower-terminals-architects-bui/status.yaml b/codev/projects/1047-tower-terminals-architects-bui/status.yaml index fbfa3f39f..9de1d56c6 100644 --- a/codev/projects/1047-tower-terminals-architects-bui/status.yaml +++ b/codev/projects/1047-tower-terminals-architects-bui/status.yaml @@ -1,7 +1,7 @@ id: '1047' title: tower-terminals-architects-bui protocol: pir -phase: plan +phase: implement plan_phases: [] current_plan_phase: null gates: @@ -17,4 +17,4 @@ iteration: 1 build_complete: false history: [] started_at: '2026-06-14T22:12:29.323Z' -updated_at: '2026-06-14T23:10:56.669Z' +updated_at: '2026-06-14T23:12:05.218Z' From 8868ac867bab256865755a34971a3a9324f2845d Mon Sep 17 00:00:00 2001 From: Amr Elsayed Date: Mon, 15 Jun 2026 09:16:56 +1000 Subject: [PATCH 09/31] [PIR #1047] Fix A+E: byte-cap RingBuffer.partial (scan-only pushData) + reconnect listener hygiene --- .../terminal/__tests__/ring-buffer.test.ts | 42 +++++++++++++ packages/codev/src/terminal/pty-manager.ts | 37 +++++++++++- packages/codev/src/terminal/pty-session.ts | 12 +++- packages/codev/src/terminal/ring-buffer.ts | 59 +++++++++++++++---- 4 files changed, 134 insertions(+), 16 deletions(-) diff --git a/packages/codev/src/terminal/__tests__/ring-buffer.test.ts b/packages/codev/src/terminal/__tests__/ring-buffer.test.ts index 7bbbd269c..a70badd73 100644 --- a/packages/codev/src/terminal/__tests__/ring-buffer.test.ts +++ b/packages/codev/src/terminal/__tests__/ring-buffer.test.ts @@ -113,6 +113,48 @@ describe('RingBuffer', () => { expect(buf.getAll()).toEqual(['hello', '', 'world']); }); + it('caps the partial to maxPartialBytes for a no-newline stream (Issue #1047)', () => { + const cap = 1024; + const buf = new RingBuffer(10, cap); + // Feed 100 KB with no newline, in 1 KB frames — mimics a full-screen TUI + // that redraws in place and never emits \n. + const frame = 'x'.repeat(1024); + for (let i = 0; i < 100; i++) { + buf.pushData(frame); + } + // No complete lines were ever pushed. + expect(buf.size).toBe(0); + // The partial is bounded by the cap, not the 100 KB total. + const all = buf.getAll(); + expect(all).toHaveLength(1); + expect(all[0].length).toBe(cap); + // It retains the most recent bytes (front-trimmed). + expect(all[0]).toBe('x'.repeat(cap)); + }); + + it('byte-cap retains the tail across a mixed boundary', () => { + const cap = 8; + const buf = new RingBuffer(10, cap); + buf.pushData('done\n'); // completes a line, clears partial + buf.pushData('ABCDEFGHIJ'); // 10 bytes, no newline → trim to last 8 + expect(buf.getAll()).toEqual(['done', 'CDEFGHIJ']); + }); + + it('byte-cap does not clip lines that fit, and still splits correctly', () => { + const cap = 16; + const buf = new RingBuffer(10, cap); + buf.pushData('short\nalsoShort\ntail'); + expect(buf.getAll()).toEqual(['short', 'alsoShort', 'tail']); + }); + + it('a newline after an over-cap run still emits a (trimmed) line', () => { + const cap = 4; + const buf = new RingBuffer(10, cap); + buf.pushData('ABCDEFGH'); // 8 bytes, no newline → partial trimmed to 'EFGH' + buf.pushData('\n'); // completes the (trimmed) line + expect(buf.getAll()).toEqual(['EFGH']); + }); + it('clear resets content but keeps seq', () => { const buf = new RingBuffer(5); buf.push('a'); diff --git a/packages/codev/src/terminal/pty-manager.ts b/packages/codev/src/terminal/pty-manager.ts index dd649e9b3..b568f5d66 100644 --- a/packages/codev/src/terminal/pty-manager.ts +++ b/packages/codev/src/terminal/pty-manager.ts @@ -11,17 +11,36 @@ import { PtySession } from './pty-session.js'; import type { PtySessionConfig, PtySessionInfo } from './pty-session.js'; import { decodeFrame, encodeControl, encodeData } from './ws-protocol.js'; import { defaultSessionOptions, DEFAULT_DISK_LOG_MAX_BYTES } from './index.js'; +import { DEFAULT_MAX_PARTIAL_BYTES } from './ring-buffer.js'; export interface TerminalManagerConfig { workspaceRoot: string; logDir?: string; // Default: /.agent-farm/logs maxSessions?: number; // Default: 50 ringBufferLines?: number; + maxPartialBytes?: number; // Default: DEFAULT_MAX_PARTIAL_BYTES (or CODEV_TERMINAL_MAX_PARTIAL_BYTES) diskLogEnabled?: boolean; diskLogMaxBytes?: number; reconnectTimeoutMs?: number; } +/** + * Resolve the ring-buffer partial byte cap (Issue #1047), allowing operators + * to tune it via `CODEV_TERMINAL_MAX_PARTIAL_BYTES` (useful when Tower is + * hosted remotely and replay bandwidth matters). Falls back to the default + * for an unset or non-positive value. + */ +function resolveMaxPartialBytes(configured?: number): number { + if (configured !== undefined && configured > 0) { + return configured; + } + const fromEnv = parseInt(process.env.CODEV_TERMINAL_MAX_PARTIAL_BYTES || '', 10); + if (Number.isFinite(fromEnv) && fromEnv > 0) { + return fromEnv; + } + return DEFAULT_MAX_PARTIAL_BYTES; +} + export interface CreateTerminalRequest { command?: string; args?: string[]; @@ -49,6 +68,7 @@ export class TerminalManager { logDir: config.logDir ?? path.join(config.workspaceRoot, '.agent-farm', 'logs'), maxSessions: config.maxSessions ?? 50, ringBufferLines: config.ringBufferLines ?? 1000, + maxPartialBytes: resolveMaxPartialBytes(config.maxPartialBytes), diskLogEnabled: config.diskLogEnabled ?? true, diskLogMaxBytes: config.diskLogMaxBytes ?? DEFAULT_DISK_LOG_MAX_BYTES, reconnectTimeoutMs: config.reconnectTimeoutMs ?? 300_000, @@ -87,6 +107,7 @@ export class TerminalManager { label: req.label ?? `terminal-${id.slice(0, 8)}`, logDir: this.config.logDir, ringBufferLines: this.config.ringBufferLines, + maxPartialBytes: this.config.maxPartialBytes, diskLogEnabled: this.config.diskLogEnabled, diskLogMaxBytes: this.config.diskLogMaxBytes, reconnectTimeoutMs: this.config.reconnectTimeoutMs, @@ -120,8 +141,12 @@ export class TerminalManager { * a fresh one. Reconnect-after-restart passes the persisted id so a terminal * keeps its identity across a Tower restart (#991): the client's WebSocket url * (`/ws/terminal/`) stays valid, so the existing reconnect machinery - * re-attaches transparently rather than hitting a dead id. The in-memory - * sessions map is empty at reconcile time, so reusing the id can't collide. + * re-attaches transparently rather than hitting a dead id. At startup reconcile + * the in-memory sessions map is empty so reuse can't collide; on the *live* + * on-the-fly reconnect path (callers guard on a missing session) it normally + * can't either, but if an entry already exists under this id we tear it down + * first (Issue #1047 Fix E) so a replaced session can't keep firing listeners + * on the surviving shellper client. */ createSessionRaw(opts: { label: string; cwd: string; id?: string }): PtySessionInfo { if (this.sessions.size >= this.config.maxSessions) { @@ -129,6 +154,13 @@ export class TerminalManager { } const id = opts.id ?? randomUUID(); + const existing = this.sessions.get(id); + if (existing) { + // Defensive teardown before overwrite: detach the old session from its + // shellper client so its data/exit/close listeners stop processing on + // the client we're about to re-attach to a fresh session. + existing.detachShellper(); + } const { cols, rows } = defaultSessionOptions(); const sessionConfig: PtySessionConfig = { id, @@ -141,6 +173,7 @@ export class TerminalManager { label: opts.label, logDir: this.config.logDir, ringBufferLines: this.config.ringBufferLines, + maxPartialBytes: this.config.maxPartialBytes, diskLogEnabled: this.config.diskLogEnabled, diskLogMaxBytes: this.config.diskLogMaxBytes, reconnectTimeoutMs: this.config.reconnectTimeoutMs, diff --git a/packages/codev/src/terminal/pty-session.ts b/packages/codev/src/terminal/pty-session.ts index 27798c5bd..62ec96258 100644 --- a/packages/codev/src/terminal/pty-session.ts +++ b/packages/codev/src/terminal/pty-session.ts @@ -21,6 +21,7 @@ export interface PtySessionConfig { label: string; logDir: string; // e.g., .agent-farm/logs/ ringBufferLines?: number; // Default: 1000 + maxPartialBytes?: number; // Default: DEFAULT_MAX_PARTIAL_BYTES (ring-buffer partial byte cap, #1047) diskLogEnabled?: boolean; // Default: true diskLogMaxBytes?: number; // Default: 50MB reconnectTimeoutMs?: number; // Default: 300_000 (5 min) @@ -74,7 +75,7 @@ export class PtySession extends EventEmitter { this.cols = config.cols; this.rows = config.rows; this.createdAt = new Date().toISOString(); - this.ringBuffer = new RingBuffer(config.ringBufferLines ?? 1000); + this.ringBuffer = new RingBuffer(config.ringBufferLines ?? 1000, config.maxPartialBytes); this.diskLogEnabled = config.diskLogEnabled ?? true; this.diskLogMaxBytes = config.diskLogMaxBytes ?? 50 * 1024 * 1024; // DEFAULT_DISK_LOG_MAX_BYTES this.reconnectTimeoutMs = config.reconnectTimeoutMs ?? 300_000; @@ -117,6 +118,15 @@ export class PtySession extends EventEmitter { * User input flows: WebSocket → write() → shellper. */ attachShellper(client: IShellperClient, replayData: Buffer, shellperPid: number, shellperSessionId?: string): void { + // Idempotent re-attach (Issue #1047 Fix E): if a previous client is still + // attached, drop our listeners on it before subscribing to the new one so + // a re-attach can't double the per-byte data fan-out (each leaked 'data' + // listener would re-run onPtyData for every PTY byte). + if (this.shellperClient && this.shellperClient !== client) { + this.shellperClient.removeAllListeners('data'); + this.shellperClient.removeAllListeners('exit'); + this.shellperClient.removeAllListeners('close'); + } this._shellperBacked = true; this.shellperClient = client; this.shellperPid = shellperPid; diff --git a/packages/codev/src/terminal/ring-buffer.ts b/packages/codev/src/terminal/ring-buffer.ts index 6b966a9de..05f7c93fd 100644 --- a/packages/codev/src/terminal/ring-buffer.ts +++ b/packages/codev/src/terminal/ring-buffer.ts @@ -3,15 +3,34 @@ * Used for reconnection replay — stores last N lines in memory. */ +/** + * Maximum bytes retained in the incomplete-line `partial` (Issue #1047). + * A full-screen TUI (alternate screen buffer) redraws in place with + * cursor-addressing and carriage returns and emits almost no `\n`, so the + * `partial` would otherwise grow without bound — turning every `pushData` + * into an O(n) re-scan (O(n²) over the session) and producing a multi-MB + * replay payload that overflows the terminal client's backpressure budget. + * 256 KB is far above any realistic single line yet bounds per-frame work, + * memory, and the replay payload. Stays well under the VSCode client's 1 MB + * `MAX_QUEUE` so a replay never trips client backpressure. + */ +export const DEFAULT_MAX_PARTIAL_BYTES = 256 * 1024; + export class RingBuffer { private buffer: string[]; private head: number = 0; private count: number = 0; private seq: number = 0; // monotonically increasing sequence number private partial: string = ''; // incomplete line from previous pushData call + private readonly maxPartialBytes: number; - constructor(private readonly capacity: number = 1000) { + constructor(private readonly capacity: number = 1000, maxPartialBytes: number = DEFAULT_MAX_PARTIAL_BYTES) { this.buffer = new Array(capacity); + if (maxPartialBytes > 0) { + this.maxPartialBytes = maxPartialBytes; + } else { + this.maxPartialBytes = DEFAULT_MAX_PARTIAL_BYTES; + } } /** Push a complete line into the buffer. Returns the assigned sequence number. */ @@ -31,23 +50,37 @@ export class RingBuffer { * chunk boundaries: if data doesn't end with \n, the trailing fragment * is held and prepended to the next pushData call. * + * Scans only the incoming `data` for newlines (never re-splits the whole + * accumulated `partial + data`), so per-call work is O(|data|) rather than + * O(|partial|) — the O(n²) re-scan that pegged Tower's CPU on no-newline + * TUI streams (Issue #1047). The trailing `partial` is byte-capped: an + * over-cap unbroken run (a TUI that never emits `\n`) is front-trimmed to + * the most recent {@link maxPartialBytes}, bounding per-frame work, memory, + * and the replay payload. Front-trimming (rather than injecting a synthetic + * newline) avoids corrupting a TUI replay with spurious line feeds; a + * reconnect drives a full repaint that heals the trimmed prefix. + * * Returns last sequence number. */ pushData(data: string): number { - const combined = this.partial + data; - const parts = combined.split('\n'); - - // Last element is either: - // - "" if data ended with \n (all lines complete) - // - non-empty if data ended mid-line (incomplete line) - // Either way, save it as the new partial. - this.partial = parts.pop()!; + let start = 0; + let nl = data.indexOf('\n'); + while (nl !== -1) { + // Complete line = held partial (if any) + this segment up to the newline. + this.push(this.partial + data.slice(start, nl)); + this.partial = ''; + start = nl + 1; + nl = data.indexOf('\n', start); + } - let lastSeq = this.seq; - for (const line of parts) { - lastSeq = this.push(line); + // Remainder has no newline — append to the partial and bound its size. + if (start < data.length) { + this.partial += data.slice(start); + if (this.partial.length > this.maxPartialBytes) { + this.partial = this.partial.slice(this.partial.length - this.maxPartialBytes); + } } - return lastSeq; + return this.seq; } /** Get all stored lines in order, including any incomplete trailing line. */ From 7d42dac953528f4ee8fe136b20aabdd463cad65d Mon Sep 17 00:00:00 2001 From: Amr Elsayed Date: Mon, 15 Jun 2026 09:19:08 +1000 Subject: [PATCH 10/31] [PIR #1047] Fix B: byte-cap ShellperReplayBuffer (bounds no-newline replay across restart) --- .../__tests__/shellper-replay-buffer.test.ts | 52 +++++++++++ packages/codev/src/terminal/shellper-main.ts | 15 ++++ .../codev/src/terminal/shellper-process.ts | 5 +- .../src/terminal/shellper-replay-buffer.ts | 86 +++++++++++++------ 4 files changed, 128 insertions(+), 30 deletions(-) create mode 100644 packages/codev/src/terminal/__tests__/shellper-replay-buffer.test.ts diff --git a/packages/codev/src/terminal/__tests__/shellper-replay-buffer.test.ts b/packages/codev/src/terminal/__tests__/shellper-replay-buffer.test.ts new file mode 100644 index 000000000..9196e174e --- /dev/null +++ b/packages/codev/src/terminal/__tests__/shellper-replay-buffer.test.ts @@ -0,0 +1,52 @@ +import { describe, it, expect } from 'vitest'; +import { ShellperReplayBuffer } from '../shellper-replay-buffer.js'; + +describe('ShellperReplayBuffer', () => { + it('evicts by line count (existing behavior preserved)', () => { + const buf = new ShellperReplayBuffer(3); + buf.append('a\nb\nc\nd\ne\n'); + // Keeps the last ~3 lines worth of data. + const replay = buf.getReplayData().toString('utf-8'); + expect(replay.endsWith('e\n')).toBe(true); + expect(replay).not.toContain('a\n'); + expect(buf.lines).toBeLessThanOrEqual(3); + }); + + it('bounds a no-newline stream by bytes (Issue #1047)', () => { + const maxBytes = 4096; + const buf = new ShellperReplayBuffer(10_000, maxBytes); + // 100 KB of redraw output with no newline — lineCount stays 0, so the + // line cap never fires; the byte cap must bound it. + const frame = Buffer.alloc(1024, 0x78); // 'x' + for (let i = 0; i < 100; i++) { + buf.append(frame); + } + expect(buf.lines).toBe(0); + expect(buf.size).toBeLessThanOrEqual(maxBytes); + expect(buf.getReplayData().length).toBeLessThanOrEqual(maxBytes); + }); + + it('byte cap retains the most recent bytes', () => { + const maxBytes = 8; + const buf = new ShellperReplayBuffer(10_000, maxBytes); + buf.append('ABCDEFGHIJKLMNOP'); // 16 bytes, no newline + expect(buf.getReplayData().toString('utf-8')).toBe('IJKLMNOP'); + }); + + it('respects both caps when a single chunk overflows lines and bytes', () => { + const buf = new ShellperReplayBuffer(2, 64); + // Many short lines in one append: over the line cap and (eventually) bytes. + const data = Array.from({ length: 50 }, (_, i) => `row${i}`).join('\n') + '\n'; + buf.append(data); + expect(buf.lines).toBeLessThanOrEqual(2); + expect(buf.size).toBeLessThanOrEqual(64); + // Most recent rows survive. + expect(buf.getReplayData().toString('utf-8')).toContain('row49'); + }); + + it('does not evict when under both caps', () => { + const buf = new ShellperReplayBuffer(10_000, 1024 * 1024); + buf.append('hello\nworld\n'); + expect(buf.getReplayData().toString('utf-8')).toBe('hello\nworld\n'); + }); +}); diff --git a/packages/codev/src/terminal/shellper-main.ts b/packages/codev/src/terminal/shellper-main.ts index 3f5acabcc..97e777283 100644 --- a/packages/codev/src/terminal/shellper-main.ts +++ b/packages/codev/src/terminal/shellper-main.ts @@ -21,6 +21,20 @@ import fs from 'node:fs'; import path from 'node:path'; import { createRequire } from 'node:module'; import { ShellperProcess, type IShellperPty, type PtyOptions } from './shellper-process.js'; +import { DEFAULT_MAX_REPLAY_BYTES } from './shellper-replay-buffer.js'; + +/** + * Resolve the shellper replay buffer byte cap (Issue #1047), tunable via + * `CODEV_SHELLPER_MAX_REPLAY_BYTES` for remote-hosted Tower deployments where + * replay payload size affects reconnect bandwidth. + */ +function resolveReplayMaxBytes(): number { + const fromEnv = parseInt(process.env.CODEV_SHELLPER_MAX_REPLAY_BYTES || '', 10); + if (Number.isFinite(fromEnv) && fromEnv > 0) { + return fromEnv; + } + return DEFAULT_MAX_REPLAY_BYTES; +} // createRequire enables importing native/CJS modules (like node-pty) from ESM. // The package uses "type": "module", so bare `require()` is not available. @@ -161,6 +175,7 @@ async function main(): Promise { config.socketPath, config.replayBufferLines ?? 10_000, logStderr, + resolveReplayMaxBytes(), ); logStderr(`Shellper started: pid=${process.pid}, command=${config.command}, socket=${config.socketPath}`); diff --git a/packages/codev/src/terminal/shellper-process.ts b/packages/codev/src/terminal/shellper-process.ts index e1b343455..ad0619900 100644 --- a/packages/codev/src/terminal/shellper-process.ts +++ b/packages/codev/src/terminal/shellper-process.ts @@ -36,7 +36,7 @@ import { type SignalMessage, type SpawnMessage, } from './shellper-protocol.js'; -import { ShellperReplayBuffer } from './shellper-replay-buffer.js'; +import { ShellperReplayBuffer, DEFAULT_MAX_REPLAY_BYTES } from './shellper-replay-buffer.js'; import { DEFAULT_COLS, DEFAULT_ROWS } from './index.js'; // --- IShellperPty: abstraction over node-pty for testing --- @@ -92,9 +92,10 @@ export class ShellperProcess extends EventEmitter { private readonly socketPath: string, replayBufferLines: number = 10_000, private readonly log: (msg: string) => void = () => {}, + replayBufferMaxBytes: number = DEFAULT_MAX_REPLAY_BYTES, ) { super(); - this.replayBuffer = new ShellperReplayBuffer(replayBufferLines); + this.replayBuffer = new ShellperReplayBuffer(replayBufferLines, replayBufferMaxBytes); } /** diff --git a/packages/codev/src/terminal/shellper-replay-buffer.ts b/packages/codev/src/terminal/shellper-replay-buffer.ts index 38fe560d1..aa3be1e44 100644 --- a/packages/codev/src/terminal/shellper-replay-buffer.ts +++ b/packages/codev/src/terminal/shellper-replay-buffer.ts @@ -9,63 +9,93 @@ * process doesn't need to pull in the full package dependency tree. */ +/** + * Default byte ceiling for the replay buffer (Issue #1047). The line-count + * cap alone never bounds a full-screen TUI stream that redraws in place and + * emits no `\n` (lineCount stays 0, so eviction never fires) — the buffer + * grows unbounded and the REPLAY frame it produces overflows the terminal + * client's backpressure budget. This byte cap bounds the buffer (and the + * replay payload) regardless of newline density. A few MB keeps a useful + * scrollback while staying small enough to replay cheaply, including across + * a remotely-hosted Tower reconnect. + */ +export const DEFAULT_MAX_REPLAY_BYTES = 2 * 1024 * 1024; + +function countNewlines(buf: Buffer): number { + let n = 0; + for (let i = 0; i < buf.length; i++) { + if (buf[i] === 0x0a) n++; + } + return n; +} + export class ShellperReplayBuffer { private chunks: Buffer[] = []; private totalBytes = 0; private readonly maxLines: number; + private readonly maxBytes: number; private lineCount = 0; /** * @param maxLines Maximum number of lines to retain. Lines are delimited * by newline characters in the raw data stream. + * @param maxBytes Maximum bytes to retain regardless of newline count. + * Guards against unbounded growth on no-newline TUI streams (#1047). */ - constructor(maxLines: number = 10_000) { + constructor(maxLines: number = 10_000, maxBytes: number = DEFAULT_MAX_REPLAY_BYTES) { this.maxLines = maxLines; + if (maxBytes > 0) { + this.maxBytes = maxBytes; + } else { + this.maxBytes = DEFAULT_MAX_REPLAY_BYTES; + } + } + + private exceedsLimit(): boolean { + return this.lineCount > this.maxLines || this.totalBytes > this.maxBytes; } /** * Append raw PTY output data to the buffer. - * Evicts oldest chunks if the line count exceeds maxLines. + * Evicts oldest data when either the line OR the byte limit is exceeded, so + * a stream with no newlines is still bounded by bytes (#1047). */ append(data: Buffer | string): void { const buf = typeof data === 'string' ? Buffer.from(data, 'utf-8') : data; if (buf.length === 0) return; - // Count newlines in this chunk - let newLines = 0; - for (let i = 0; i < buf.length; i++) { - if (buf[i] === 0x0a) newLines++; - } - this.chunks.push(buf); this.totalBytes += buf.length; - this.lineCount += newLines; + this.lineCount += countNewlines(buf); - // Evict oldest chunks if we've exceeded the line limit - while (this.lineCount > this.maxLines && this.chunks.length > 1) { - const oldest = this.chunks[0]; - let removedLines = 0; - for (let i = 0; i < oldest.length; i++) { - if (oldest[i] === 0x0a) removedLines++; - } - this.chunks.shift(); + // Evict whole oldest chunks while either limit is exceeded. + while (this.exceedsLimit() && this.chunks.length > 1) { + const oldest = this.chunks.shift()!; this.totalBytes -= oldest.length; - this.lineCount -= removedLines; + this.lineCount -= countNewlines(oldest); } - // Handle edge case: single chunk exceeds line limit. - // Trim from the front to keep only the last maxLines lines. - if (this.lineCount > this.maxLines && this.chunks.length === 1) { + // Single remaining chunk still over a limit: trim from the front far enough + // to satisfy BOTH the line and the byte bound. + if (this.exceedsLimit() && this.chunks.length === 1) { const chunk = this.chunks[0]; - let linesToSkip = this.lineCount - this.maxLines; let offset = 0; - while (linesToSkip > 0 && offset < chunk.length) { - if (chunk[offset] === 0x0a) linesToSkip--; - offset++; + if (this.lineCount > this.maxLines) { + let linesToSkip = this.lineCount - this.maxLines; + while (linesToSkip > 0 && offset < chunk.length) { + if (chunk[offset] === 0x0a) linesToSkip--; + offset++; + } + } + const byteOffset = chunk.length - this.maxBytes; + if (byteOffset > offset) { + offset = byteOffset; + } + if (offset > 0) { + this.chunks[0] = chunk.subarray(offset); + this.totalBytes = this.chunks[0].length; + this.lineCount = countNewlines(this.chunks[0]); } - this.chunks[0] = chunk.subarray(offset); - this.totalBytes = this.chunks[0].length; - this.lineCount = this.maxLines; } } From f7e6b3a2979a3dd1dbcb191ba7d7c84b8edfa534 Mon Sep 17 00:00:00 2001 From: Amr Elsayed Date: Mon, 15 Jun 2026 09:24:23 +1000 Subject: [PATCH 11/31] [PIR #1047] Fix C+D: pause/resume replay bracket, drop-not-reconnect on live overload, resume-delta reconnect --- .../src/agent-farm/servers/tower-websocket.ts | 8 +- .../src/__tests__/terminal-adapter.test.ts | 86 +++++++++++++++++++ packages/vscode/src/terminal-adapter.ts | 72 ++++++++++++++-- 3 files changed, 158 insertions(+), 8 deletions(-) diff --git a/packages/codev/src/agent-farm/servers/tower-websocket.ts b/packages/codev/src/agent-farm/servers/tower-websocket.ts index 1c57054cd..d6c6170cf 100644 --- a/packages/codev/src/agent-farm/servers/tower-websocket.ts +++ b/packages/codev/src/agent-farm/servers/tower-websocket.ts @@ -58,11 +58,17 @@ export function handleTerminalWebSocket(ws: WebSocket, session: PtySession, req: replayLines = session.attach(client); } - // Send replay data as binary data frame + // Send replay data as binary data frame, bracketed by pause/resume control + // frames (#1047). The bracket tells the client "this is the one-shot buffer + // snapshot" so it paces the write and excludes it from its live-backpressure + // budget. Without it, a client counts a large replay as live overload and + // (historically) looped forever reconnecting for the same oversized replay. if (replayLines.length > 0) { const replayData = replayLines.join('\n'); if (ws.readyState === WebSocket.OPEN) { + ws.send(encodeControl({ type: 'pause', payload: {} })); ws.send(encodeData(replayData)); + ws.send(encodeControl({ type: 'resume', payload: {} })); } } diff --git a/packages/vscode/src/__tests__/terminal-adapter.test.ts b/packages/vscode/src/__tests__/terminal-adapter.test.ts index 50e057cf5..bc632f17b 100644 --- a/packages/vscode/src/__tests__/terminal-adapter.test.ts +++ b/packages/vscode/src/__tests__/terminal-adapter.test.ts @@ -65,6 +65,23 @@ vi.mock('@cluesmith/codev-core/escape-buffer', () => ({ vi.mock('@cluesmith/codev-types', () => ({ FRAME_CONTROL: 0x00, FRAME_DATA: 0x01 })); +const FRAME_CONTROL = 0x00; +const FRAME_DATA = 0x01; + +/** Deliver a binary DATA frame to the adapter's `message` handler. */ +function sendData(socket: { emit(e: string, ...a: unknown[]): void }, payload: Buffer): void { + socket.emit('message', Buffer.concat([Buffer.from([FRAME_DATA]), payload])); +} + +/** Deliver a binary CONTROL frame (JSON) to the adapter's `message` handler. */ +function sendControl( + socket: { emit(e: string, ...a: unknown[]): void }, + msg: { type: string; payload: Record }, +): void { + const json = Buffer.from(JSON.stringify(msg), 'utf-8'); + socket.emit('message', Buffer.concat([Buffer.from([FRAME_CONTROL]), json])); +} + // Imports AFTER mocks are registered. const WebSocket = (await import('ws')).default as unknown as { instances: Array<{ closed: boolean; emit(e: string, ...a: unknown[]): void }>; @@ -294,3 +311,72 @@ describe('PIR #1001 — reconnect notices overwrite in place and clear on succes expect(writes[0]).toContain(RECONNECT_LINK_TEXT); }); }); + +describe('PIR #1047 — oversized replay storm prevention', () => { + function reconnectTo(pty: unknown, url: string): void { + (pty as { reconnect(u?: string): void }).reconnect(url); + } + + it('renders a >1MB replay (pause-bracketed) without reconnecting', () => { + const { pty, writes } = makeAdapter(); + currentSocket().emit('open'); + const socketsBefore = WebSocket.instances.length; + writes.length = 0; + + // Tower brackets the buffer snapshot: pause → (big replay) → resume. + sendControl(currentSocket(), { type: 'pause', payload: {} }); + sendData(currentSocket(), Buffer.alloc(2 * 1024 * 1024, 0x41)); // 2 MB + sendControl(currentSocket(), { type: 'resume', payload: {} }); + + // No reconnect (the old bug looped here ~14k times), and content rendered. + expect(WebSocket.instances.length).toBe(socketsBefore); + expect(writes.length).toBeGreaterThan(0); + expect(writes.join('')).not.toContain(RECONNECT_LINK_TEXT); // did not give up + void pty; + }); + + it('drops live output over the queue limit instead of reconnecting', () => { + const { writes } = makeAdapter(); + currentSocket().emit('open'); + const socketsBefore = WebSocket.instances.length; + writes.length = 0; + + // Not bracketed → treated as live. The old code called reconnect() here. + sendData(currentSocket(), Buffer.alloc(2 * 1024 * 1024, 0x42)); // 2 MB + + expect(WebSocket.instances.length).toBe(socketsBefore); // dropped, no reconnect + }); + + it('still renders normal-sized live output', () => { + const { writes } = makeAdapter(); + currentSocket().emit('open'); + writes.length = 0; + + sendData(currentSocket(), Buffer.from('hello world', 'utf-8')); + expect(writes.join('')).toContain('hello world'); + }); + + it('reconnect requests a resume delta from the last seq', () => { + const { writes } = makeAdapter(); + currentSocket().emit('open'); + writes.length = 0; + + sendControl(currentSocket(), { type: 'seq', payload: { seq: 42 } }); + currentSocket().emit('close'); + vi.advanceTimersByTime(1000); // scheduled reconnect opens a new socket + + expect((currentSocket() as unknown as { url: string }).url).toContain('resume=42'); + }); + + it('a successor-session reconnect resets seq and does a full (no-resume) replay', () => { + const { pty } = makeAdapter(); + currentSocket().emit('open'); + sendControl(currentSocket(), { type: 'seq', payload: { seq: 42 } }); + + reconnectTo(pty, 'ws://localhost:4100/successor'); + + const url = (currentSocket() as unknown as { url: string }).url; + expect(url).toContain('/successor'); + expect(url).not.toContain('resume'); + }); +}); diff --git a/packages/vscode/src/terminal-adapter.ts b/packages/vscode/src/terminal-adapter.ts index e4579a766..0e274fa88 100644 --- a/packages/vscode/src/terminal-adapter.ts +++ b/packages/vscode/src/terminal-adapter.ts @@ -5,7 +5,8 @@ import { EscapeBuffer } from '@cluesmith/codev-core/escape-buffer'; import { BackoffController, classifyUpgradeError } from '@cluesmith/codev-core/reconnect-policy'; const CHUNK_SIZE = 16384; // 16KB — chunk onDidWrite to avoid CPU spikes -const MAX_QUEUE = 1048576; // 1MB — disconnect if queue exceeds this +const MAX_QUEUE = 1048576; // 1MB — drop live output if the unrendered queue exceeds this +const DROP_WARN_INTERVAL_MS = 5000; // throttle backpressure-drop warnings (#1047) // Reconnect backoff. The exponential curve (1000 * 2^attempt, capped at 30s), // the give-up threshold, and the session-unknown classifier all live in the @@ -46,6 +47,7 @@ export class CodevPseudoterminal implements vscode.Pseudoterminal { // until the user happens to manually resize the panel (Bugfix #737). private lastDimensions: { cols: number; rows: number } | null = null; private queuedBytes = 0; + private lastDropWarnAt = 0; private disposed = false; // Reconnect-loop state. The adapter owns reconnection end-to-end (#936) — @@ -127,11 +129,27 @@ export class CodevPseudoterminal implements vscode.Pseudoterminal { // ── WebSocket ──────────────────────────────────────────────── + /** + * Build the effective WebSocket URL, requesting a delta replay via + * `?resume=` when we've already received data (#1047). This ships + * only the bytes produced during the disconnect window instead of the whole + * buffer on every reconnect — the dominant win when Tower is hosted remotely + * and reconnects are frequent. A fresh connect (lastSeq 0) gets a full + * (bounded) replay, as does a switch to a successor session whose lastSeq + * was reset. + */ + private connectUrl(): string { + if (this.lastSeq <= 0) { return this.wsUrl; } + const sep = this.wsUrl.includes('?') ? '&' : '?'; + return `${this.wsUrl}${sep}resume=${this.lastSeq}`; + } + private connect(): void { if (this.disposed) { return; } - this.log('INFO', `Connecting to ${this.wsUrl}`); - const socket = new WebSocket(this.wsUrl); + const url = this.connectUrl(); + this.log('INFO', `Connecting to ${url}`); + const socket = new WebSocket(url); this.ws = socket; this.ws.binaryType = 'arraybuffer'; @@ -271,6 +289,11 @@ export class CodevPseudoterminal implements vscode.Pseudoterminal { private resetStreamState(): void { this.decoder = new TextDecoder('utf-8', { fatal: false }); this.escapeBuffer = new EscapeBuffer(); + // Clear replay/backpressure state so a connection that dropped mid-replay + // (no `resume` control seen) doesn't leave `replaying` stuck true and + // mis-route the next connection's live data (#1047). + this.replaying = false; + this.queuedBytes = 0; } /** @@ -280,7 +303,13 @@ export class CodevPseudoterminal implements vscode.Pseudoterminal { */ reconnect(wsUrl?: string): void { if (this.disposed) { return; } - if (wsUrl) { this.wsUrl = wsUrl; } + if (wsUrl && wsUrl !== this.wsUrl) { + // A successor session (#991) is a different terminal id, so our lastSeq + // doesn't apply — reset it so connectUrl() requests a full (bounded) + // replay against the new session rather than a stale delta (#1047). + this.wsUrl = wsUrl; + this.lastSeq = 0; + } if (this.reconnectTimer) { clearTimeout(this.reconnectTimer); this.reconnectTimer = null; @@ -298,12 +327,28 @@ export class CodevPseudoterminal implements vscode.Pseudoterminal { // ── Data handling ──────────────────────────────────────────── private handleData(payload: Buffer): void { - // Backpressure check + // Replay burst (bracketed by Tower's pause→…→resume, #1047): the initial + // buffer snapshot is expected to be large and is delivered once. Pace it + // through the chunked writer and do NOT count it toward the live + // backpressure budget — otherwise a multi-MB replay trips the guard before + // anything renders and the old reconnect-for-replay path looped forever. + if (this.replaying) { + const text = this.decoder.decode(payload, { stream: true }); + const safe = this.escapeBuffer.write(text); + if (safe.length > 0) { this.writeChunked(safe); } + return; + } + + // Live overload: if rendered output falls far enough behind that the queue + // exceeds MAX_QUEUE, DROP this burst rather than reconnecting. Terminal + // output is ephemeral — the app repaints — so dropping is safe, and unlike + // a reconnect it does not re-download the whole buffer (costly remotely) + // or risk an infinite replay storm (#1047). Mirrors Tower's own + // bufferedAmount drop on the send side. this.queuedBytes += payload.length; if (this.queuedBytes > MAX_QUEUE) { - this.log('WARN', 'Backpressure exceeded 1MB — disconnecting for replay'); + this.warnDroppedThrottled(); this.queuedBytes = 0; - this.reconnect(); return; } @@ -320,6 +365,19 @@ export class CodevPseudoterminal implements vscode.Pseudoterminal { } } + /** + * Warn (at most once per interval) that live output was dropped under + * backpressure. Throttled so a sustained burst can't spam the log the way + * the old reconnect-storm did (#1047 produced ~14k lines in under an hour). + */ + private warnDroppedThrottled(): void { + const now = Date.now(); + if (now - this.lastDropWarnAt >= DROP_WARN_INTERVAL_MS) { + this.lastDropWarnAt = now; + this.log('WARN', 'Backpressure exceeded 1MB — dropping output (terminal will repaint)'); + } + } + private writeChunked(text: string): void { let offset = 0; const writeNext = (): void => { From 9f282745cd58871da316477315e9980f3fe051ec Mon Sep 17 00:00:00 2001 From: Amr Elsayed Date: Mon, 15 Jun 2026 09:26:21 +1000 Subject: [PATCH 12/31] [PIR #1047] Instrumentation: log terminal ring-buffer partial sizes on a Tower monitor interval --- .../src/agent-farm/servers/tower-server.ts | 30 +++++++++++++++++++ .../terminal/__tests__/ring-buffer.test.ts | 9 ++++++ packages/codev/src/terminal/pty-manager.ts | 14 +++++++++ packages/codev/src/terminal/pty-session.ts | 5 ++++ packages/codev/src/terminal/ring-buffer.ts | 5 ++++ 5 files changed, 63 insertions(+) diff --git a/packages/codev/src/agent-farm/servers/tower-server.ts b/packages/codev/src/agent-farm/servers/tower-server.ts index 808e97133..d7e2b65d2 100644 --- a/packages/codev/src/agent-farm/servers/tower-server.ts +++ b/packages/codev/src/agent-farm/servers/tower-server.ts @@ -63,6 +63,13 @@ const rateLimitCleanupInterval = startRateLimitCleanup(); // Shellper session manager (initialized at startup) let shellperManager: SessionManager | null = null; let shellperCleanupInterval: NodeJS.Timeout | null = null; +let terminalPartialMonitorInterval: NodeJS.Timeout | null = null; + +// Observability for Issue #1047: a session whose ring-buffer partial grows +// toward this size is a no-newline TUI stream — the shape that pegged Tower's +// CPU before the byte cap. Logged periodically; warned when crossed. +const PARTIAL_WARN_BYTES = 200 * 1024; +const TERMINAL_MONITOR_INTERVAL_MS = 60_000; // Parse arguments with Commander const program = new Command() @@ -154,6 +161,7 @@ async function gracefulShutdown(signal: string): Promise { // 4. Stop rate limit cleanup, shellper periodic cleanup, and SSE heartbeat clearInterval(rateLimitCleanupInterval); if (shellperCleanupInterval) clearInterval(shellperCleanupInterval); + if (terminalPartialMonitorInterval) clearInterval(terminalPartialMonitorInterval); clearInterval(sseHeartbeatInterval); // 4b. Flush and stop send buffer (Spec 403) — delivers any deferred messages @@ -391,6 +399,28 @@ server.listen(port, bindHost, async () => { getKnownWorkspacePaths, }); + // Issue #1047 observability: periodically report terminal ring-buffer + // partial sizes so a no-newline TUI stream (the freeze trigger) is visible + // in the Tower log. Cheap (O(sessions) once a minute) and .unref()'d so it + // never holds the process open. + terminalPartialMonitorInterval = setInterval(() => { + try { + const partials = getTerminalManager().inspectPartials(); + if (partials.length === 0) return; + let maxBytes = 0; + for (const p of partials) { + if (p.partialBytes > maxBytes) maxBytes = p.partialBytes; + if (p.partialBytes >= PARTIAL_WARN_BYTES) { + log('WARN', `Terminal ${p.id} (${p.label}) ring-buffer partial at ${Math.round(p.partialBytes / 1024)} KB — no-newline TUI stream, capped (#1047)`); + } + } + log('INFO', `Terminal partial monitor: ${partials.length} session(s), max partial ${Math.round(maxBytes / 1024)} KB`); + } catch (err) { + log('ERROR', `Terminal partial monitor failed: ${(err as Error).message}`); + } + }, TERMINAL_MONITOR_INTERVAL_MS); + terminalPartialMonitorInterval.unref(); + // Spec 403: Start send buffer for typing-aware message delivery startSendBuffer(log); diff --git a/packages/codev/src/terminal/__tests__/ring-buffer.test.ts b/packages/codev/src/terminal/__tests__/ring-buffer.test.ts index a70badd73..726164f9b 100644 --- a/packages/codev/src/terminal/__tests__/ring-buffer.test.ts +++ b/packages/codev/src/terminal/__tests__/ring-buffer.test.ts @@ -155,6 +155,15 @@ describe('RingBuffer', () => { expect(buf.getAll()).toEqual(['EFGH']); }); + it('partialBytes reports the held incomplete-line size', () => { + const buf = new RingBuffer(10, 1024); + expect(buf.partialBytes).toBe(0); + buf.pushData('abc'); + expect(buf.partialBytes).toBe(3); + buf.pushData('def\n'); // completes the line, clears partial + expect(buf.partialBytes).toBe(0); + }); + it('clear resets content but keeps seq', () => { const buf = new RingBuffer(5); buf.push('a'); diff --git a/packages/codev/src/terminal/pty-manager.ts b/packages/codev/src/terminal/pty-manager.ts index b568f5d66..74a6628dc 100644 --- a/packages/codev/src/terminal/pty-manager.ts +++ b/packages/codev/src/terminal/pty-manager.ts @@ -200,6 +200,20 @@ export class TerminalManager { return Array.from(this.sessions.values()).map(s => s.info); } + /** + * Snapshot of each session's ring-buffer partial size and client count + * (Issue #1047 observability). A partial creeping toward the cap flags a + * no-newline TUI stream — the shape that used to peg Tower's CPU. + */ + inspectPartials(): Array<{ id: string; label: string; partialBytes: number; clients: number }> { + return Array.from(this.sessions.values()).map(s => ({ + id: s.id, + label: s.label, + partialBytes: s.partialBytes, + clients: s.clientCount, + })); + } + /** Get a session by ID. */ getSession(id: string): PtySession | undefined { return this.sessions.get(id); diff --git a/packages/codev/src/terminal/pty-session.ts b/packages/codev/src/terminal/pty-session.ts index 62ec96258..040501654 100644 --- a/packages/codev/src/terminal/pty-session.ts +++ b/packages/codev/src/terminal/pty-session.ts @@ -420,6 +420,11 @@ export class PtySession extends EventEmitter { return this.clients.size; } + /** Bytes held in the ring buffer's incomplete-line partial (observability, #1047). */ + get partialBytes(): number { + return this.ringBuffer.partialBytes; + } + /** Record that a user sent input to this session. */ recordUserInput(): void { this._lastInputAt = Date.now(); diff --git a/packages/codev/src/terminal/ring-buffer.ts b/packages/codev/src/terminal/ring-buffer.ts index 05f7c93fd..677b5d4ea 100644 --- a/packages/codev/src/terminal/ring-buffer.ts +++ b/packages/codev/src/terminal/ring-buffer.ts @@ -123,6 +123,11 @@ export class RingBuffer { return this.count; } + /** Bytes held in the incomplete-line partial (observability, #1047). */ + get partialBytes(): number { + return this.partial.length; + } + /** Clear the buffer and release memory. */ clear(): void { this.buffer = []; From 757a75a55fe2262b8e359c3f18361368aa269d65 Mon Sep 17 00:00:00 2001 From: Amr Elsayed Date: Mon, 15 Jun 2026 09:30:40 +1000 Subject: [PATCH 13/31] [PIR #1047] Update tower-websocket test for pause/resume replay bracket --- .../agent-farm/__tests__/tower-websocket.test.ts | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/packages/codev/src/agent-farm/__tests__/tower-websocket.test.ts b/packages/codev/src/agent-farm/__tests__/tower-websocket.test.ts index 7d7152eea..d43357624 100644 --- a/packages/codev/src/agent-farm/__tests__/tower-websocket.test.ts +++ b/packages/codev/src/agent-farm/__tests__/tower-websocket.test.ts @@ -128,7 +128,7 @@ describe('tower-websocket', () => { expect(session.attach).not.toHaveBeenCalled(); }); - it('sends replay data and seq frame on connect', () => { + it('brackets replay with pause/resume and sends a seq frame on connect (#1047)', () => { const ws = makeWs(); const session = makeSession(5); session.attach.mockReturnValue(['line1', 'line2']); @@ -136,8 +136,16 @@ describe('tower-websocket', () => { handleTerminalWebSocket(ws, session, req); - // Should send replay data + seq control frame - expect(ws.send).toHaveBeenCalledTimes(2); + // pause control + replay data + resume control + seq control = 4 frames. + // The pause/resume bracket lets the client exclude the (potentially large) + // replay snapshot from its live-backpressure budget (#1047). + expect(ws.send).toHaveBeenCalledTimes(4); + + const controlTypes = ws.send.mock.calls + .map((call: unknown[]) => call[0] as Buffer) + .filter((buf: Buffer) => buf[0] === 0x00) + .map((buf: Buffer) => JSON.parse(buf.subarray(1).toString('utf-8')).type); + expect(controlTypes).toEqual(['pause', 'resume', 'seq']); }); it('sends only seq frame when no replay lines', () => { From 2f5e6351983a0ff04c6d4ef87205377b7aa00643 Mon Sep 17 00:00:00 2001 From: Amr Elsayed Date: Mon, 15 Jun 2026 09:32:26 +1000 Subject: [PATCH 14/31] [PIR #1047] docs: user-facing CHANGELOG entry for terminal-freeze fix --- codev/state/pir-1047_thread.md | 15 +++++++++++++++ packages/vscode/CHANGELOG.md | 6 ++++++ 2 files changed, 21 insertions(+) diff --git a/codev/state/pir-1047_thread.md b/codev/state/pir-1047_thread.md index 4a62ed742..43f579ffa 100644 --- a/codev/state/pir-1047_thread.md +++ b/codev/state/pir-1047_thread.md @@ -51,3 +51,18 @@ Architect asked for the most resilient approach given Tower may be hosted locall - Caps env-configurable; Fix C removes the hard MAX_PARTIAL_BYTES Date: Mon, 15 Jun 2026 09:33:49 +1000 Subject: [PATCH 15/31] chore(porch): 1047 dev-approval gate-requested --- codev/projects/1047-tower-terminals-architects-bui/status.yaml | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/codev/projects/1047-tower-terminals-architects-bui/status.yaml b/codev/projects/1047-tower-terminals-architects-bui/status.yaml index 9de1d56c6..ed92eb8d3 100644 --- a/codev/projects/1047-tower-terminals-architects-bui/status.yaml +++ b/codev/projects/1047-tower-terminals-architects-bui/status.yaml @@ -11,10 +11,11 @@ gates: approved_at: '2026-06-14T23:10:56.668Z' dev-approval: status: pending + requested_at: '2026-06-14T23:33:49.007Z' pr: status: pending iteration: 1 build_complete: false history: [] started_at: '2026-06-14T22:12:29.323Z' -updated_at: '2026-06-14T23:12:05.218Z' +updated_at: '2026-06-14T23:33:49.008Z' From b05b73f22dbc80a311668e81536f420002e92dfd Mon Sep 17 00:00:00 2001 From: Amr Elsayed Date: Mon, 15 Jun 2026 11:18:57 +1000 Subject: [PATCH 16/31] [PIR #1047] Hoist replay-cap resolution out of constructor args; document the three distinct size limits --- packages/codev/src/terminal/shellper-main.ts | 6 ++++-- packages/codev/src/terminal/shellper-replay-buffer.ts | 9 +++++++++ 2 files changed, 13 insertions(+), 2 deletions(-) diff --git a/packages/codev/src/terminal/shellper-main.ts b/packages/codev/src/terminal/shellper-main.ts index 97e777283..bdbd76818 100644 --- a/packages/codev/src/terminal/shellper-main.ts +++ b/packages/codev/src/terminal/shellper-main.ts @@ -170,12 +170,14 @@ async function main(): Promise { // File doesn't exist — fine } + const replayBufferLines = config.replayBufferLines ?? 10_000; + const replayBufferMaxBytes = resolveReplayMaxBytes(); const shellper = new ShellperProcess( createRealPty, config.socketPath, - config.replayBufferLines ?? 10_000, + replayBufferLines, logStderr, - resolveReplayMaxBytes(), + replayBufferMaxBytes, ); logStderr(`Shellper started: pid=${process.pid}, command=${config.command}, socket=${config.socketPath}`); diff --git a/packages/codev/src/terminal/shellper-replay-buffer.ts b/packages/codev/src/terminal/shellper-replay-buffer.ts index aa3be1e44..3ad73f693 100644 --- a/packages/codev/src/terminal/shellper-replay-buffer.ts +++ b/packages/codev/src/terminal/shellper-replay-buffer.ts @@ -18,6 +18,15 @@ * replay payload) regardless of newline density. A few MB keeps a useful * scrollback while staying small enough to replay cheaply, including across * a remotely-hosted Tower reconnect. + * + * Deliberately a *distinct* limit from the Tower ring buffer's + * `DEFAULT_MAX_PARTIAL_BYTES` (256 KB) and the VSCode client's `MAX_QUEUE` + * (1 MB) — different layers, not a value to unify: the shellper retains + * scrollback for replay, Tower re-caps its own partial on ingest (`pushData`), + * and the client bounds unrendered output. It lives here rather than in a + * shared module because this file is dependency-free so the shellper process + * stays lean (see the header); importing from the `index.ts` barrel would pull + * in the full terminal tree (node-pty, ws). */ export const DEFAULT_MAX_REPLAY_BYTES = 2 * 1024 * 1024; From 02ff5546fd55834b2a50e23afbf0072c2688276f Mon Sep 17 00:00:00 2001 From: Amr Elsayed Date: Mon, 15 Jun 2026 12:00:44 +1000 Subject: [PATCH 17/31] [PIR #1047] Remove byte caps: keep partial/replay whole for faithful TUI replay MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The byte caps (new in this PR) front-trimmed the alt-screen stream, dropping the screen-setup prefix and blanking terminals until a manual resize. They were never needed for the freeze fix (scan-only pushData fixes CPU; the pause/resume bracket + drop-not-reconnect fix the storm) — they only bounded memory/replay size, which #1047 rated minor. Reverts partial/replay to main's faithful behavior; keeps scan-only, the bracket, drop-not-reconnect, resume-delta, listener hygiene, and the (now growth-observability) partial monitor. --- codev/state/pir-1047_thread.md | 9 ++ .../src/agent-farm/servers/tower-server.ts | 12 ++- .../terminal/__tests__/ring-buffer.test.ts | 44 ++------- .../__tests__/shellper-replay-buffer.test.ts | 52 ---------- packages/codev/src/terminal/pty-manager.ts | 27 +----- packages/codev/src/terminal/pty-session.ts | 3 +- packages/codev/src/terminal/ring-buffer.ts | 37 ++------ packages/codev/src/terminal/shellper-main.ts | 19 +--- .../codev/src/terminal/shellper-process.ts | 5 +- .../src/terminal/shellper-replay-buffer.ts | 95 ++++++------------- packages/vscode/CHANGELOG.md | 2 +- 11 files changed, 68 insertions(+), 237 deletions(-) delete mode 100644 packages/codev/src/terminal/__tests__/shellper-replay-buffer.test.ts diff --git a/codev/state/pir-1047_thread.md b/codev/state/pir-1047_thread.md index 43f579ffa..79f288564 100644 --- a/codev/state/pir-1047_thread.md +++ b/codev/state/pir-1047_thread.md @@ -66,3 +66,12 @@ Committed in order: DEVIATION from plan (byte-addressable seq): kept seq LINE-based. The byte-cap already bounds replay to <=256KB ( { for (const p of partials) { if (p.partialBytes > maxBytes) maxBytes = p.partialBytes; if (p.partialBytes >= PARTIAL_WARN_BYTES) { - log('WARN', `Terminal ${p.id} (${p.label}) ring-buffer partial at ${Math.round(p.partialBytes / 1024)} KB — no-newline TUI stream, capped (#1047)`); + log('WARN', `Terminal ${p.id} (${p.label}) ring-buffer partial at ${Math.round(p.partialBytes / 1024)} KB — large no-newline TUI stream growing unbounded (#1047)`); } } log('INFO', `Terminal partial monitor: ${partials.length} session(s), max partial ${Math.round(maxBytes / 1024)} KB`); diff --git a/packages/codev/src/terminal/__tests__/ring-buffer.test.ts b/packages/codev/src/terminal/__tests__/ring-buffer.test.ts index 726164f9b..d2e226a6d 100644 --- a/packages/codev/src/terminal/__tests__/ring-buffer.test.ts +++ b/packages/codev/src/terminal/__tests__/ring-buffer.test.ts @@ -113,50 +113,24 @@ describe('RingBuffer', () => { expect(buf.getAll()).toEqual(['hello', '', 'world']); }); - it('caps the partial to maxPartialBytes for a no-newline stream (Issue #1047)', () => { - const cap = 1024; - const buf = new RingBuffer(10, cap); - // Feed 100 KB with no newline, in 1 KB frames — mimics a full-screen TUI - // that redraws in place and never emits \n. + it('keeps a no-newline stream whole for faithful replay (Issue #1047)', () => { + const buf = new RingBuffer(10); + // 100 KB with no newline, in 1 KB frames — mimics a full-screen TUI that + // redraws in place and never emits \n. The whole stream must be preserved + // (not truncated) so a reconnection replay can reconstruct the screen. const frame = 'x'.repeat(1024); for (let i = 0; i < 100; i++) { buf.pushData(frame); } - // No complete lines were ever pushed. - expect(buf.size).toBe(0); - // The partial is bounded by the cap, not the 100 KB total. + expect(buf.size).toBe(0); // no complete lines const all = buf.getAll(); expect(all).toHaveLength(1); - expect(all[0].length).toBe(cap); - // It retains the most recent bytes (front-trimmed). - expect(all[0]).toBe('x'.repeat(cap)); - }); - - it('byte-cap retains the tail across a mixed boundary', () => { - const cap = 8; - const buf = new RingBuffer(10, cap); - buf.pushData('done\n'); // completes a line, clears partial - buf.pushData('ABCDEFGHIJ'); // 10 bytes, no newline → trim to last 8 - expect(buf.getAll()).toEqual(['done', 'CDEFGHIJ']); - }); - - it('byte-cap does not clip lines that fit, and still splits correctly', () => { - const cap = 16; - const buf = new RingBuffer(10, cap); - buf.pushData('short\nalsoShort\ntail'); - expect(buf.getAll()).toEqual(['short', 'alsoShort', 'tail']); - }); - - it('a newline after an over-cap run still emits a (trimmed) line', () => { - const cap = 4; - const buf = new RingBuffer(10, cap); - buf.pushData('ABCDEFGH'); // 8 bytes, no newline → partial trimmed to 'EFGH' - buf.pushData('\n'); // completes the (trimmed) line - expect(buf.getAll()).toEqual(['EFGH']); + expect(all[0].length).toBe(100 * 1024); // full content retained, not capped + expect(buf.partialBytes).toBe(100 * 1024); }); it('partialBytes reports the held incomplete-line size', () => { - const buf = new RingBuffer(10, 1024); + const buf = new RingBuffer(10); expect(buf.partialBytes).toBe(0); buf.pushData('abc'); expect(buf.partialBytes).toBe(3); diff --git a/packages/codev/src/terminal/__tests__/shellper-replay-buffer.test.ts b/packages/codev/src/terminal/__tests__/shellper-replay-buffer.test.ts deleted file mode 100644 index 9196e174e..000000000 --- a/packages/codev/src/terminal/__tests__/shellper-replay-buffer.test.ts +++ /dev/null @@ -1,52 +0,0 @@ -import { describe, it, expect } from 'vitest'; -import { ShellperReplayBuffer } from '../shellper-replay-buffer.js'; - -describe('ShellperReplayBuffer', () => { - it('evicts by line count (existing behavior preserved)', () => { - const buf = new ShellperReplayBuffer(3); - buf.append('a\nb\nc\nd\ne\n'); - // Keeps the last ~3 lines worth of data. - const replay = buf.getReplayData().toString('utf-8'); - expect(replay.endsWith('e\n')).toBe(true); - expect(replay).not.toContain('a\n'); - expect(buf.lines).toBeLessThanOrEqual(3); - }); - - it('bounds a no-newline stream by bytes (Issue #1047)', () => { - const maxBytes = 4096; - const buf = new ShellperReplayBuffer(10_000, maxBytes); - // 100 KB of redraw output with no newline — lineCount stays 0, so the - // line cap never fires; the byte cap must bound it. - const frame = Buffer.alloc(1024, 0x78); // 'x' - for (let i = 0; i < 100; i++) { - buf.append(frame); - } - expect(buf.lines).toBe(0); - expect(buf.size).toBeLessThanOrEqual(maxBytes); - expect(buf.getReplayData().length).toBeLessThanOrEqual(maxBytes); - }); - - it('byte cap retains the most recent bytes', () => { - const maxBytes = 8; - const buf = new ShellperReplayBuffer(10_000, maxBytes); - buf.append('ABCDEFGHIJKLMNOP'); // 16 bytes, no newline - expect(buf.getReplayData().toString('utf-8')).toBe('IJKLMNOP'); - }); - - it('respects both caps when a single chunk overflows lines and bytes', () => { - const buf = new ShellperReplayBuffer(2, 64); - // Many short lines in one append: over the line cap and (eventually) bytes. - const data = Array.from({ length: 50 }, (_, i) => `row${i}`).join('\n') + '\n'; - buf.append(data); - expect(buf.lines).toBeLessThanOrEqual(2); - expect(buf.size).toBeLessThanOrEqual(64); - // Most recent rows survive. - expect(buf.getReplayData().toString('utf-8')).toContain('row49'); - }); - - it('does not evict when under both caps', () => { - const buf = new ShellperReplayBuffer(10_000, 1024 * 1024); - buf.append('hello\nworld\n'); - expect(buf.getReplayData().toString('utf-8')).toBe('hello\nworld\n'); - }); -}); diff --git a/packages/codev/src/terminal/pty-manager.ts b/packages/codev/src/terminal/pty-manager.ts index 74a6628dc..dcfeeadb3 100644 --- a/packages/codev/src/terminal/pty-manager.ts +++ b/packages/codev/src/terminal/pty-manager.ts @@ -11,36 +11,17 @@ import { PtySession } from './pty-session.js'; import type { PtySessionConfig, PtySessionInfo } from './pty-session.js'; import { decodeFrame, encodeControl, encodeData } from './ws-protocol.js'; import { defaultSessionOptions, DEFAULT_DISK_LOG_MAX_BYTES } from './index.js'; -import { DEFAULT_MAX_PARTIAL_BYTES } from './ring-buffer.js'; export interface TerminalManagerConfig { workspaceRoot: string; logDir?: string; // Default: /.agent-farm/logs maxSessions?: number; // Default: 50 ringBufferLines?: number; - maxPartialBytes?: number; // Default: DEFAULT_MAX_PARTIAL_BYTES (or CODEV_TERMINAL_MAX_PARTIAL_BYTES) diskLogEnabled?: boolean; diskLogMaxBytes?: number; reconnectTimeoutMs?: number; } -/** - * Resolve the ring-buffer partial byte cap (Issue #1047), allowing operators - * to tune it via `CODEV_TERMINAL_MAX_PARTIAL_BYTES` (useful when Tower is - * hosted remotely and replay bandwidth matters). Falls back to the default - * for an unset or non-positive value. - */ -function resolveMaxPartialBytes(configured?: number): number { - if (configured !== undefined && configured > 0) { - return configured; - } - const fromEnv = parseInt(process.env.CODEV_TERMINAL_MAX_PARTIAL_BYTES || '', 10); - if (Number.isFinite(fromEnv) && fromEnv > 0) { - return fromEnv; - } - return DEFAULT_MAX_PARTIAL_BYTES; -} - export interface CreateTerminalRequest { command?: string; args?: string[]; @@ -68,7 +49,6 @@ export class TerminalManager { logDir: config.logDir ?? path.join(config.workspaceRoot, '.agent-farm', 'logs'), maxSessions: config.maxSessions ?? 50, ringBufferLines: config.ringBufferLines ?? 1000, - maxPartialBytes: resolveMaxPartialBytes(config.maxPartialBytes), diskLogEnabled: config.diskLogEnabled ?? true, diskLogMaxBytes: config.diskLogMaxBytes ?? DEFAULT_DISK_LOG_MAX_BYTES, reconnectTimeoutMs: config.reconnectTimeoutMs ?? 300_000, @@ -107,7 +87,6 @@ export class TerminalManager { label: req.label ?? `terminal-${id.slice(0, 8)}`, logDir: this.config.logDir, ringBufferLines: this.config.ringBufferLines, - maxPartialBytes: this.config.maxPartialBytes, diskLogEnabled: this.config.diskLogEnabled, diskLogMaxBytes: this.config.diskLogMaxBytes, reconnectTimeoutMs: this.config.reconnectTimeoutMs, @@ -173,7 +152,6 @@ export class TerminalManager { label: opts.label, logDir: this.config.logDir, ringBufferLines: this.config.ringBufferLines, - maxPartialBytes: this.config.maxPartialBytes, diskLogEnabled: this.config.diskLogEnabled, diskLogMaxBytes: this.config.diskLogMaxBytes, reconnectTimeoutMs: this.config.reconnectTimeoutMs, @@ -202,8 +180,9 @@ export class TerminalManager { /** * Snapshot of each session's ring-buffer partial size and client count - * (Issue #1047 observability). A partial creeping toward the cap flags a - * no-newline TUI stream — the shape that used to peg Tower's CPU. + * (Issue #1047 observability). The partial is unbounded (kept whole for + * faithful replay); a large, growing partial flags a no-newline full-screen + * TUI stream, so this surfaces memory growth if it ever becomes a concern. */ inspectPartials(): Array<{ id: string; label: string; partialBytes: number; clients: number }> { return Array.from(this.sessions.values()).map(s => ({ diff --git a/packages/codev/src/terminal/pty-session.ts b/packages/codev/src/terminal/pty-session.ts index 040501654..f282ee8d9 100644 --- a/packages/codev/src/terminal/pty-session.ts +++ b/packages/codev/src/terminal/pty-session.ts @@ -21,7 +21,6 @@ export interface PtySessionConfig { label: string; logDir: string; // e.g., .agent-farm/logs/ ringBufferLines?: number; // Default: 1000 - maxPartialBytes?: number; // Default: DEFAULT_MAX_PARTIAL_BYTES (ring-buffer partial byte cap, #1047) diskLogEnabled?: boolean; // Default: true diskLogMaxBytes?: number; // Default: 50MB reconnectTimeoutMs?: number; // Default: 300_000 (5 min) @@ -75,7 +74,7 @@ export class PtySession extends EventEmitter { this.cols = config.cols; this.rows = config.rows; this.createdAt = new Date().toISOString(); - this.ringBuffer = new RingBuffer(config.ringBufferLines ?? 1000, config.maxPartialBytes); + this.ringBuffer = new RingBuffer(config.ringBufferLines ?? 1000); this.diskLogEnabled = config.diskLogEnabled ?? true; this.diskLogMaxBytes = config.diskLogMaxBytes ?? 50 * 1024 * 1024; // DEFAULT_DISK_LOG_MAX_BYTES this.reconnectTimeoutMs = config.reconnectTimeoutMs ?? 300_000; diff --git a/packages/codev/src/terminal/ring-buffer.ts b/packages/codev/src/terminal/ring-buffer.ts index 677b5d4ea..b42746719 100644 --- a/packages/codev/src/terminal/ring-buffer.ts +++ b/packages/codev/src/terminal/ring-buffer.ts @@ -3,34 +3,15 @@ * Used for reconnection replay — stores last N lines in memory. */ -/** - * Maximum bytes retained in the incomplete-line `partial` (Issue #1047). - * A full-screen TUI (alternate screen buffer) redraws in place with - * cursor-addressing and carriage returns and emits almost no `\n`, so the - * `partial` would otherwise grow without bound — turning every `pushData` - * into an O(n) re-scan (O(n²) over the session) and producing a multi-MB - * replay payload that overflows the terminal client's backpressure budget. - * 256 KB is far above any realistic single line yet bounds per-frame work, - * memory, and the replay payload. Stays well under the VSCode client's 1 MB - * `MAX_QUEUE` so a replay never trips client backpressure. - */ -export const DEFAULT_MAX_PARTIAL_BYTES = 256 * 1024; - export class RingBuffer { private buffer: string[]; private head: number = 0; private count: number = 0; private seq: number = 0; // monotonically increasing sequence number private partial: string = ''; // incomplete line from previous pushData call - private readonly maxPartialBytes: number; - constructor(private readonly capacity: number = 1000, maxPartialBytes: number = DEFAULT_MAX_PARTIAL_BYTES) { + constructor(private readonly capacity: number = 1000) { this.buffer = new Array(capacity); - if (maxPartialBytes > 0) { - this.maxPartialBytes = maxPartialBytes; - } else { - this.maxPartialBytes = DEFAULT_MAX_PARTIAL_BYTES; - } } /** Push a complete line into the buffer. Returns the assigned sequence number. */ @@ -53,12 +34,11 @@ export class RingBuffer { * Scans only the incoming `data` for newlines (never re-splits the whole * accumulated `partial + data`), so per-call work is O(|data|) rather than * O(|partial|) — the O(n²) re-scan that pegged Tower's CPU on no-newline - * TUI streams (Issue #1047). The trailing `partial` is byte-capped: an - * over-cap unbroken run (a TUI that never emits `\n`) is front-trimmed to - * the most recent {@link maxPartialBytes}, bounding per-frame work, memory, - * and the replay payload. Front-trimming (rather than injecting a synthetic - * newline) avoids corrupting a TUI replay with spurious line feeds; a - * reconnect drives a full repaint that heals the trimmed prefix. + * full-screen-TUI streams (Issue #1047). The `partial` is kept whole and + * unbounded so a reconnection replay faithfully reconstructs the screen: a + * TUI in the alternate screen buffer encodes its state in the cumulative + * byte stream from the alt-screen-enter onward, so truncating the front + * would corrupt the replay (the app won't repaint on a same-size reconnect). * * Returns last sequence number. */ @@ -73,12 +53,9 @@ export class RingBuffer { nl = data.indexOf('\n', start); } - // Remainder has no newline — append to the partial and bound its size. + // Remainder has no newline — append to the partial (cons-string, O(|data|)). if (start < data.length) { this.partial += data.slice(start); - if (this.partial.length > this.maxPartialBytes) { - this.partial = this.partial.slice(this.partial.length - this.maxPartialBytes); - } } return this.seq; } diff --git a/packages/codev/src/terminal/shellper-main.ts b/packages/codev/src/terminal/shellper-main.ts index bdbd76818..3f5acabcc 100644 --- a/packages/codev/src/terminal/shellper-main.ts +++ b/packages/codev/src/terminal/shellper-main.ts @@ -21,20 +21,6 @@ import fs from 'node:fs'; import path from 'node:path'; import { createRequire } from 'node:module'; import { ShellperProcess, type IShellperPty, type PtyOptions } from './shellper-process.js'; -import { DEFAULT_MAX_REPLAY_BYTES } from './shellper-replay-buffer.js'; - -/** - * Resolve the shellper replay buffer byte cap (Issue #1047), tunable via - * `CODEV_SHELLPER_MAX_REPLAY_BYTES` for remote-hosted Tower deployments where - * replay payload size affects reconnect bandwidth. - */ -function resolveReplayMaxBytes(): number { - const fromEnv = parseInt(process.env.CODEV_SHELLPER_MAX_REPLAY_BYTES || '', 10); - if (Number.isFinite(fromEnv) && fromEnv > 0) { - return fromEnv; - } - return DEFAULT_MAX_REPLAY_BYTES; -} // createRequire enables importing native/CJS modules (like node-pty) from ESM. // The package uses "type": "module", so bare `require()` is not available. @@ -170,14 +156,11 @@ async function main(): Promise { // File doesn't exist — fine } - const replayBufferLines = config.replayBufferLines ?? 10_000; - const replayBufferMaxBytes = resolveReplayMaxBytes(); const shellper = new ShellperProcess( createRealPty, config.socketPath, - replayBufferLines, + config.replayBufferLines ?? 10_000, logStderr, - replayBufferMaxBytes, ); logStderr(`Shellper started: pid=${process.pid}, command=${config.command}, socket=${config.socketPath}`); diff --git a/packages/codev/src/terminal/shellper-process.ts b/packages/codev/src/terminal/shellper-process.ts index ad0619900..e1b343455 100644 --- a/packages/codev/src/terminal/shellper-process.ts +++ b/packages/codev/src/terminal/shellper-process.ts @@ -36,7 +36,7 @@ import { type SignalMessage, type SpawnMessage, } from './shellper-protocol.js'; -import { ShellperReplayBuffer, DEFAULT_MAX_REPLAY_BYTES } from './shellper-replay-buffer.js'; +import { ShellperReplayBuffer } from './shellper-replay-buffer.js'; import { DEFAULT_COLS, DEFAULT_ROWS } from './index.js'; // --- IShellperPty: abstraction over node-pty for testing --- @@ -92,10 +92,9 @@ export class ShellperProcess extends EventEmitter { private readonly socketPath: string, replayBufferLines: number = 10_000, private readonly log: (msg: string) => void = () => {}, - replayBufferMaxBytes: number = DEFAULT_MAX_REPLAY_BYTES, ) { super(); - this.replayBuffer = new ShellperReplayBuffer(replayBufferLines, replayBufferMaxBytes); + this.replayBuffer = new ShellperReplayBuffer(replayBufferLines); } /** diff --git a/packages/codev/src/terminal/shellper-replay-buffer.ts b/packages/codev/src/terminal/shellper-replay-buffer.ts index 3ad73f693..38fe560d1 100644 --- a/packages/codev/src/terminal/shellper-replay-buffer.ts +++ b/packages/codev/src/terminal/shellper-replay-buffer.ts @@ -9,102 +9,63 @@ * process doesn't need to pull in the full package dependency tree. */ -/** - * Default byte ceiling for the replay buffer (Issue #1047). The line-count - * cap alone never bounds a full-screen TUI stream that redraws in place and - * emits no `\n` (lineCount stays 0, so eviction never fires) — the buffer - * grows unbounded and the REPLAY frame it produces overflows the terminal - * client's backpressure budget. This byte cap bounds the buffer (and the - * replay payload) regardless of newline density. A few MB keeps a useful - * scrollback while staying small enough to replay cheaply, including across - * a remotely-hosted Tower reconnect. - * - * Deliberately a *distinct* limit from the Tower ring buffer's - * `DEFAULT_MAX_PARTIAL_BYTES` (256 KB) and the VSCode client's `MAX_QUEUE` - * (1 MB) — different layers, not a value to unify: the shellper retains - * scrollback for replay, Tower re-caps its own partial on ingest (`pushData`), - * and the client bounds unrendered output. It lives here rather than in a - * shared module because this file is dependency-free so the shellper process - * stays lean (see the header); importing from the `index.ts` barrel would pull - * in the full terminal tree (node-pty, ws). - */ -export const DEFAULT_MAX_REPLAY_BYTES = 2 * 1024 * 1024; - -function countNewlines(buf: Buffer): number { - let n = 0; - for (let i = 0; i < buf.length; i++) { - if (buf[i] === 0x0a) n++; - } - return n; -} - export class ShellperReplayBuffer { private chunks: Buffer[] = []; private totalBytes = 0; private readonly maxLines: number; - private readonly maxBytes: number; private lineCount = 0; /** * @param maxLines Maximum number of lines to retain. Lines are delimited * by newline characters in the raw data stream. - * @param maxBytes Maximum bytes to retain regardless of newline count. - * Guards against unbounded growth on no-newline TUI streams (#1047). */ - constructor(maxLines: number = 10_000, maxBytes: number = DEFAULT_MAX_REPLAY_BYTES) { + constructor(maxLines: number = 10_000) { this.maxLines = maxLines; - if (maxBytes > 0) { - this.maxBytes = maxBytes; - } else { - this.maxBytes = DEFAULT_MAX_REPLAY_BYTES; - } - } - - private exceedsLimit(): boolean { - return this.lineCount > this.maxLines || this.totalBytes > this.maxBytes; } /** * Append raw PTY output data to the buffer. - * Evicts oldest data when either the line OR the byte limit is exceeded, so - * a stream with no newlines is still bounded by bytes (#1047). + * Evicts oldest chunks if the line count exceeds maxLines. */ append(data: Buffer | string): void { const buf = typeof data === 'string' ? Buffer.from(data, 'utf-8') : data; if (buf.length === 0) return; + // Count newlines in this chunk + let newLines = 0; + for (let i = 0; i < buf.length; i++) { + if (buf[i] === 0x0a) newLines++; + } + this.chunks.push(buf); this.totalBytes += buf.length; - this.lineCount += countNewlines(buf); + this.lineCount += newLines; - // Evict whole oldest chunks while either limit is exceeded. - while (this.exceedsLimit() && this.chunks.length > 1) { - const oldest = this.chunks.shift()!; + // Evict oldest chunks if we've exceeded the line limit + while (this.lineCount > this.maxLines && this.chunks.length > 1) { + const oldest = this.chunks[0]; + let removedLines = 0; + for (let i = 0; i < oldest.length; i++) { + if (oldest[i] === 0x0a) removedLines++; + } + this.chunks.shift(); this.totalBytes -= oldest.length; - this.lineCount -= countNewlines(oldest); + this.lineCount -= removedLines; } - // Single remaining chunk still over a limit: trim from the front far enough - // to satisfy BOTH the line and the byte bound. - if (this.exceedsLimit() && this.chunks.length === 1) { + // Handle edge case: single chunk exceeds line limit. + // Trim from the front to keep only the last maxLines lines. + if (this.lineCount > this.maxLines && this.chunks.length === 1) { const chunk = this.chunks[0]; + let linesToSkip = this.lineCount - this.maxLines; let offset = 0; - if (this.lineCount > this.maxLines) { - let linesToSkip = this.lineCount - this.maxLines; - while (linesToSkip > 0 && offset < chunk.length) { - if (chunk[offset] === 0x0a) linesToSkip--; - offset++; - } - } - const byteOffset = chunk.length - this.maxBytes; - if (byteOffset > offset) { - offset = byteOffset; - } - if (offset > 0) { - this.chunks[0] = chunk.subarray(offset); - this.totalBytes = this.chunks[0].length; - this.lineCount = countNewlines(this.chunks[0]); + while (linesToSkip > 0 && offset < chunk.length) { + if (chunk[offset] === 0x0a) linesToSkip--; + offset++; } + this.chunks[0] = chunk.subarray(offset); + this.totalBytes = this.chunks[0].length; + this.lineCount = this.maxLines; } } diff --git a/packages/vscode/CHANGELOG.md b/packages/vscode/CHANGELOG.md index 0fcdc248c..ddc060edc 100644 --- a/packages/vscode/CHANGELOG.md +++ b/packages/vscode/CHANGELOG.md @@ -4,7 +4,7 @@ ### Bug fixes -- **Terminals no longer freeze over time until you restart Tower.** After hours of use, architect and builder terminals could stop rendering output all at once, with a Tower restart the only (and unreliable) fix. The cause was a full-screen TUI (such as the Claude Code prompt) that redraws in place and emits very little newline-terminated output: the buffer Tower keeps for reconnection grew without bound, and the oversized snapshot it sent on connect blew past the terminal's 1 MB receive budget, so the terminal disconnected, reconnected, received the same oversized snapshot, and looped — thousands of times a minute — pegging Tower's CPU and starving every terminal. The buffer is now size-bounded, the reconnection snapshot is delivered in a paced way that no longer counts against the live backpressure budget, a reconnect now asks Tower only for what it missed instead of re-downloading the whole buffer (important when Tower runs remotely), and genuine live overload now drops ephemeral output (the terminal repaints) instead of reconnecting. Buffer caps are tunable via `CODEV_TERMINAL_MAX_PARTIAL_BYTES` and `CODEV_SHELLPER_MAX_REPLAY_BYTES` for remote-hosted setups. +- **Terminals no longer freeze over time until you restart Tower.** After hours of use, architect and builder terminals could stop rendering output all at once, with a Tower restart the only (and unreliable) fix. The cause was a full-screen TUI (such as the Claude Code prompt) that redraws in place and emits very little newline-terminated output: Tower re-scanned its entire (ever-growing) reconnection buffer on every chunk of output, and the oversized snapshot it sent on connect blew past the terminal's 1 MB receive budget, so the terminal disconnected, reconnected, received the same oversized snapshot, and looped — thousands of times a minute — pegging Tower's CPU and starving every terminal. Tower now processes only each new chunk instead of re-scanning the whole buffer, the reconnection snapshot is delivered in a paced way that no longer counts against the live backpressure budget, a reconnect now asks Tower only for what it missed instead of re-downloading the whole buffer (important when Tower runs remotely), and genuine live overload now drops ephemeral output (the terminal repaints) instead of reconnecting. ## [3.1.9] - 2026-06-08 From 7100a5a38a3b18b6782f1797ac1949928ab77ebf Mon Sep 17 00:00:00 2001 From: Amr Elsayed Date: Mon, 15 Jun 2026 13:56:50 +1000 Subject: [PATCH 18/31] [PIR #1047] Re-assert terminal size after replay (Option A) + temporary [#1047-diag] logging --- codev/state/pir-1047_thread.md | 9 +++++++ packages/vscode/src/terminal-adapter.ts | 32 +++++++++++++++++++++---- 2 files changed, 36 insertions(+), 5 deletions(-) diff --git a/codev/state/pir-1047_thread.md b/codev/state/pir-1047_thread.md index 79f288564..00a8a5e98 100644 --- a/codev/state/pir-1047_thread.md +++ b/codev/state/pir-1047_thread.md @@ -75,3 +75,12 @@ Key realization: the byte caps were NOT needed for the freeze fix. CPU is fixed Action (architect-approved): stripped both byte caps, reverting partial/replay to main's faithful-unbounded behavior. Kept: scan-only pushData (CPU), Fix D bracket (no storm), client drop-not-reconnect + resume-delta + resize-deferral, Fix E listener hygiene, instrumentation (partialBytes monitor; WARN threshold raised to 4MB, reworded — it now surfaces unbounded growth for observability, not a cap). Removed env vars CODEV_TERMINAL_MAX_PARTIAL_BYTES / CODEV_SHELLPER_MAX_REPLAY_BYTES. Memory bounding (if ever needed) = screen-aware anchoring as a separate follow-up. Tests after revert: core 3308 pass / 0 fail; vscode terminal-adapter 18 pass; build green. Awaiting re-test at dev-approval. + +## Blank-on-open: Option A + diagnostics (2026-06-15) +Removing byte caps did NOT fix the blank-on-first-open → truncation wasn't the cause. The blank is the resize path: on first open the PTY stays at spawn-time size, the real setDimensions is deferred/lost during the replay window, app never gets SIGWINCH → blank until manual resize. (Confirmed empirically: manual window resize paints it.) + +Fix = Option A (suggested earlier this session): in the `resume` control handler, always re-assert size via `sendResize(pendingResize ?? lastDimensions)` instead of only-if-pendingResize. Automates the known-good manual-resize action → SIGWINCH → repaint. #625 deferral-during-replay protection intact (we only send AFTER resume). + +Also added temporary diagnostic logging (all tagged `[#1047-diag]`, greppable, to be stripped before gate): logs open(dims), setDimensions (deferred vs now), WS-open resize, sendResize source, pause/resume/seq controls, handleData replay-vs-live + byte counts. User builds+installs the extension directly, so these surface in the Terminal output channel. + +NEXT: user rebuilds/installs extension, opens a terminal, pastes [#1047-diag] log. Confirms whether Option A paints it AND shows the exact event sequence. diff --git a/packages/vscode/src/terminal-adapter.ts b/packages/vscode/src/terminal-adapter.ts index 0e274fa88..7174b1e34 100644 --- a/packages/vscode/src/terminal-adapter.ts +++ b/packages/vscode/src/terminal-adapter.ts @@ -71,6 +71,7 @@ export class CodevPseudoterminal implements vscode.Pseudoterminal { ) {} open(initialDimensions: vscode.TerminalDimensions | undefined): void { + this.log('INFO', `[#1047-diag] open(initialDimensions=${initialDimensions ? `${initialDimensions.columns}x${initialDimensions.rows}` : 'undefined'})`); if (initialDimensions) { this.lastDimensions = { cols: initialDimensions.columns, rows: initialDimensions.rows }; } @@ -121,9 +122,11 @@ export class CodevPseudoterminal implements vscode.Pseudoterminal { this.lastDimensions = { cols: dimensions.columns, rows: dimensions.rows }; if (this.replaying) { // Defer resize during replay to prevent garbled rendering (Bugfix #625) + this.log('INFO', `[#1047-diag] setDimensions(${dimensions.columns}x${dimensions.rows}) DEFERRED (replaying)`); this.pendingResize = { cols: dimensions.columns, rows: dimensions.rows }; return; } + this.log('INFO', `[#1047-diag] setDimensions(${dimensions.columns}x${dimensions.rows}) sending now`); this.sendResize(dimensions.columns, dimensions.rows); } @@ -173,7 +176,10 @@ export class CodevPseudoterminal implements vscode.Pseudoterminal { // overlap streaming content. Pause/replay messages arrive *after* // this outbound resize, so the order is auth → resize → replay → resume. if (this.lastDimensions) { + this.log('INFO', `[#1047-diag] WS open: sending resize ${this.lastDimensions.cols}x${this.lastDimensions.rows}`); this.sendResize(this.lastDimensions.cols, this.lastDimensions.rows); + } else { + this.log('INFO', '[#1047-diag] WS open: no lastDimensions, no resize sent'); } }); @@ -335,9 +341,11 @@ export class CodevPseudoterminal implements vscode.Pseudoterminal { if (this.replaying) { const text = this.decoder.decode(payload, { stream: true }); const safe = this.escapeBuffer.write(text); + this.log('INFO', `[#1047-diag] handleData REPLAY: payload=${payload.length}B, rendered=${safe.length}B`); if (safe.length > 0) { this.writeChunked(safe); } return; } + this.log('INFO', `[#1047-diag] handleData LIVE: payload=${payload.length}B`); // Live overload: if rendered output falls far enough behind that the queue // exceeds MAX_QUEUE, DROP this burst rather than reconnecting. Terminal @@ -397,20 +405,32 @@ export class CodevPseudoterminal implements vscode.Pseudoterminal { switch (msg.type) { case 'seq': this.lastSeq = (msg.payload.seq as number) ?? this.lastSeq; + this.log('INFO', `[#1047-diag] control: seq=${this.lastSeq}`); break; case 'pong': break; case 'pause': + this.log('INFO', '[#1047-diag] control: pause (replay start)'); this.replaying = true; break; - case 'resume': + case 'resume': { + this.log('INFO', `[#1047-diag] control: resume (replay end), pendingResize=${this.pendingResize ? `${this.pendingResize.cols}x${this.pendingResize.rows}` : 'none'}, lastDimensions=${this.lastDimensions ? `${this.lastDimensions.cols}x${this.lastDimensions.rows}` : 'none'}`); this.replaying = false; - // Flush deferred resize - if (this.pendingResize) { - this.sendResize(this.pendingResize.cols, this.pendingResize.rows); - this.pendingResize = null; + // Re-assert the terminal size once replay finishes (#1047). On a first + // open the PTY is still at its spawn-time default and the real + // setDimensions was deferred during the replay window, so the TUI never + // gets a SIGWINCH and stays blank until the user manually resizes the + // window. Sending the known size here forces that SIGWINCH so the app + // repaints — automating the manual-resize that empirically fixes it. + // Prefer any resize deferred during replay, else the last known + // dimensions. A redundant same-size resize is harmless. + const dims = this.pendingResize ?? this.lastDimensions; + this.pendingResize = null; + if (dims) { + this.sendResize(dims.cols, dims.rows); } break; + } case 'error': this.log('ERROR', `Server error: ${JSON.stringify(msg.payload)}`); break; @@ -430,6 +450,8 @@ export class CodevPseudoterminal implements vscode.Pseudoterminal { } private sendResize(cols: number, rows: number): void { + const willSend = this.ws !== null && this.ws.readyState === WebSocket.OPEN; + this.log('INFO', `[#1047-diag] sendResize(${cols}x${rows}) wsOpen=${willSend}`); this.sendControl({ type: 'resize', payload: { cols, rows } }); } From 81c4a83b9db564982929a975aeee3662bac8179b Mon Sep 17 00:00:00 2001 From: Amr Elsayed Date: Mon, 15 Jun 2026 14:43:38 +1000 Subject: [PATCH 19/31] [PIR #1047] Force a post-connect redraw nudge (mirror web client's SIGWINCH-on-connect); revert Option A --- codev/state/pir-1047_thread.md | 11 ++++ packages/vscode/src/terminal-adapter.ts | 79 ++++++++++++++++++++----- 2 files changed, 74 insertions(+), 16 deletions(-) diff --git a/codev/state/pir-1047_thread.md b/codev/state/pir-1047_thread.md index 00a8a5e98..64cbd185d 100644 --- a/codev/state/pir-1047_thread.md +++ b/codev/state/pir-1047_thread.md @@ -84,3 +84,14 @@ Fix = Option A (suggested earlier this session): in the `resume` control handler Also added temporary diagnostic logging (all tagged `[#1047-diag]`, greppable, to be stripped before gate): logs open(dims), setDimensions (deferred vs now), WS-open resize, sendResize source, pause/resume/seq controls, handleData replay-vs-live + byte counts. User builds+installs the extension directly, so these surface in the Terminal output channel. NEXT: user rebuilds/installs extension, opens a terminal, pastes [#1047-diag] log. Confirms whether Option A paints it AND shows the exact event sequence. + +## Blank-on-open ROOT CAUSE found via web-vs-vscode comparison (2026-06-15) +User asked the decisive question: why does the WEB dashboard render terminals fine but VS Code blanks? Both hit the same Tower/WS/PTY/Claude. → the bug is in the VS Code CLIENT's connect path, not Tower/PTY/SIGWINCH; the app DOES paint (web proves it). + +Read packages/dashboard/src/components/Terminal.tsx. The web client's flushInitialBuffer (lines 463-506) UNCONDITIONALLY sends a resize ~500ms after connect, in every branch, with the comment "send SIGWINCH to make the shell redraw at the correct size". It even has a skipReplay mode: "discard replay, just send SIGWINCH to make the running program redraw from scratch". So the web's rendering robustness = forcing a post-connect redraw-SIGWINCH. The VS Code adapter had NO equivalent — only the on-open resize (which is a no-op if it matches the PTY size) → app never redraws → blank until manual resize. + +FIX (mirrors the web): after WS open, schedule a settle-delay (500ms) repaint nudge — if nothing rendered yet, send resize(cols,rows-1) then resize(cols,rows) to guarantee a real SIGWINCH at the correct size. Gated on renderedSinceConnect so reconnects that painted via replay don't reflow. Cleared on close/reconnect. Nudge (vs plain resend) chosen because VS Code can't rely on the web's fit-difference; the 1-row delta guarantees the SIGWINCH even at same size. + +This also resolves the earlier "how do you know it painted?" concern: the web client DOESN'T detect paint — it sends an unconditional delayed resize and trusts it. So we do the same (single gated nudge), not retry-until-painted. + +Reverted Option A (resume-handler change) — log proved it never runs (empty buffer, no resume). Diagnostics still in (tagged [#1047-diag]) for this test. NEXT: user rebuilds/installs, opens terminal, confirms it paints + pastes [#1047-diag] log (should show "repaint nudge: ...→..." then handleData LIVE). diff --git a/packages/vscode/src/terminal-adapter.ts b/packages/vscode/src/terminal-adapter.ts index 7174b1e34..7188734cd 100644 --- a/packages/vscode/src/terminal-adapter.ts +++ b/packages/vscode/src/terminal-adapter.ts @@ -7,6 +7,7 @@ import { BackoffController, classifyUpgradeError } from '@cluesmith/codev-core/r const CHUNK_SIZE = 16384; // 16KB — chunk onDidWrite to avoid CPU spikes const MAX_QUEUE = 1048576; // 1MB — drop live output if the unrendered queue exceeds this const DROP_WARN_INTERVAL_MS = 5000; // throttle backpressure-drop warnings (#1047) +const REPAINT_NUDGE_DELAY_MS = 500; // settle delay before forcing a redraw-SIGWINCH (#1047), matching the web client's 500ms // Reconnect backoff. The exponential curve (1000 * 2^attempt, capped at 30s), // the give-up threshold, and the session-unknown classifier all live in the @@ -50,6 +51,17 @@ export class CodevPseudoterminal implements vscode.Pseudoterminal { private lastDropWarnAt = 0; private disposed = false; + // Repaint-nudge state (#1047). A freshly-attached terminal can stay blank: + // the app inside (e.g. Claude's full-screen TUI) only paints after a real + // window-size change (SIGWINCH), but the resize we send on connect can be a + // same-size no-op or land before the app reacts. The web dashboard client + // avoids this by unconditionally sending a "redraw" resize ~500ms after + // connect (Terminal.tsx flushInitialBuffer); the Pseudoterminal path here had + // no equivalent. We mirror it: after a settle delay, if nothing has rendered, + // nudge the size (a brief 1-row change, then back) to force the SIGWINCH. + private renderedSinceConnect = false; + private repaintNudgeTimer: ReturnType | null = null; + // Reconnect-loop state. The adapter owns reconnection end-to-end (#936) — // backoff scheduling, give-up after MAX_RECONNECT_ATTEMPTS, and a terminal // failure state that stops the loop until the user manually reconnects. The @@ -85,6 +97,7 @@ export class CodevPseudoterminal implements vscode.Pseudoterminal { close(): void { this.disposed = true; + this.clearRepaintNudge(); if (this.reconnectTimer) { clearTimeout(this.reconnectTimer); this.reconnectTimer = null; @@ -181,6 +194,10 @@ export class CodevPseudoterminal implements vscode.Pseudoterminal { } else { this.log('INFO', '[#1047-diag] WS open: no lastDimensions, no resize sent'); } + // Force a redraw-SIGWINCH after the connection settles if nothing has + // rendered (#1047), mirroring the web dashboard's post-connect resize. + this.renderedSinceConnect = false; + this.scheduleRepaintNudge(); }); this.ws.on('message', (raw: ArrayBuffer) => { @@ -300,6 +317,9 @@ export class CodevPseudoterminal implements vscode.Pseudoterminal { // mis-route the next connection's live data (#1047). this.replaying = false; this.queuedBytes = 0; + // Drop any pending repaint nudge from a prior connection; the next WS open + // reschedules it. + this.clearRepaintNudge(); } /** @@ -342,10 +362,11 @@ export class CodevPseudoterminal implements vscode.Pseudoterminal { const text = this.decoder.decode(payload, { stream: true }); const safe = this.escapeBuffer.write(text); this.log('INFO', `[#1047-diag] handleData REPLAY: payload=${payload.length}B, rendered=${safe.length}B`); - if (safe.length > 0) { this.writeChunked(safe); } + if (safe.length > 0) { this.renderedSinceConnect = true; this.writeChunked(safe); } return; } this.log('INFO', `[#1047-diag] handleData LIVE: payload=${payload.length}B`); + this.renderedSinceConnect = true; // Live overload: if rendered output falls far enough behind that the queue // exceeds MAX_QUEUE, DROP this burst rather than reconnecting. Terminal @@ -386,6 +407,41 @@ export class CodevPseudoterminal implements vscode.Pseudoterminal { } } + /** + * Schedule the post-connect repaint nudge (#1047). Fires once after a settle + * delay; if the pane is still blank (nothing rendered since connect), it + * forces a SIGWINCH so the app redraws — the equivalent of the web client's + * unconditional post-connect resize. Skipped when replay/live output has + * already painted, so a reconnect that rendered from the buffer doesn't + * reflow. + */ + private scheduleRepaintNudge(): void { + this.clearRepaintNudge(); + this.repaintNudgeTimer = setTimeout(() => { + this.repaintNudgeTimer = null; + if (this.disposed || this.renderedSinceConnect) { + this.log('INFO', `[#1047-diag] repaint nudge skipped (rendered=${this.renderedSinceConnect})`); + return; + } + if (!this.lastDimensions) { return; } + const { cols, rows } = this.lastDimensions; + if (rows <= 1) { this.sendResize(cols, rows); return; } + // A 1-row change then back guarantees a real TIOCSWINSZ delta (and thus a + // SIGWINCH) even if the PTY is already at the target size, ending at the + // correct dimensions. The brief intermediate frame is overwritten at once. + this.log('INFO', `[#1047-diag] repaint nudge: ${cols}x${rows - 1} -> ${cols}x${rows}`); + this.sendResize(cols, rows - 1); + this.sendResize(cols, rows); + }, REPAINT_NUDGE_DELAY_MS); + } + + private clearRepaintNudge(): void { + if (this.repaintNudgeTimer) { + clearTimeout(this.repaintNudgeTimer); + this.repaintNudgeTimer = null; + } + } + private writeChunked(text: string): void { let offset = 0; const writeNext = (): void => { @@ -413,24 +469,15 @@ export class CodevPseudoterminal implements vscode.Pseudoterminal { this.log('INFO', '[#1047-diag] control: pause (replay start)'); this.replaying = true; break; - case 'resume': { - this.log('INFO', `[#1047-diag] control: resume (replay end), pendingResize=${this.pendingResize ? `${this.pendingResize.cols}x${this.pendingResize.rows}` : 'none'}, lastDimensions=${this.lastDimensions ? `${this.lastDimensions.cols}x${this.lastDimensions.rows}` : 'none'}`); + case 'resume': + this.log('INFO', '[#1047-diag] control: resume (replay end)'); this.replaying = false; - // Re-assert the terminal size once replay finishes (#1047). On a first - // open the PTY is still at its spawn-time default and the real - // setDimensions was deferred during the replay window, so the TUI never - // gets a SIGWINCH and stays blank until the user manually resizes the - // window. Sending the known size here forces that SIGWINCH so the app - // repaints — automating the manual-resize that empirically fixes it. - // Prefer any resize deferred during replay, else the last known - // dimensions. A redundant same-size resize is harmless. - const dims = this.pendingResize ?? this.lastDimensions; - this.pendingResize = null; - if (dims) { - this.sendResize(dims.cols, dims.rows); + // Flush deferred resize + if (this.pendingResize) { + this.sendResize(this.pendingResize.cols, this.pendingResize.rows); + this.pendingResize = null; } break; - } case 'error': this.log('ERROR', `Server error: ${JSON.stringify(msg.payload)}`); break; From 6d994f72e90c60cc0569f91a8a09e122b790e17d Mon Sep 17 00:00:00 2001 From: Amr Elsayed Date: Mon, 15 Jun 2026 14:50:07 +1000 Subject: [PATCH 20/31] [PIR #1047] docs: changelog entry for terminal blank-on-open fix --- packages/vscode/CHANGELOG.md | 1 + 1 file changed, 1 insertion(+) diff --git a/packages/vscode/CHANGELOG.md b/packages/vscode/CHANGELOG.md index ddc060edc..301a9879d 100644 --- a/packages/vscode/CHANGELOG.md +++ b/packages/vscode/CHANGELOG.md @@ -5,6 +5,7 @@ ### Bug fixes - **Terminals no longer freeze over time until you restart Tower.** After hours of use, architect and builder terminals could stop rendering output all at once, with a Tower restart the only (and unreliable) fix. The cause was a full-screen TUI (such as the Claude Code prompt) that redraws in place and emits very little newline-terminated output: Tower re-scanned its entire (ever-growing) reconnection buffer on every chunk of output, and the oversized snapshot it sent on connect blew past the terminal's 1 MB receive budget, so the terminal disconnected, reconnected, received the same oversized snapshot, and looped — thousands of times a minute — pegging Tower's CPU and starving every terminal. Tower now processes only each new chunk instead of re-scanning the whole buffer, the reconnection snapshot is delivered in a paced way that no longer counts against the live backpressure budget, a reconnect now asks Tower only for what it missed instead of re-downloading the whole buffer (important when Tower runs remotely), and genuine live overload now drops ephemeral output (the terminal repaints) instead of reconnecting. +- **Terminals no longer open to a blank pane until you resize the window.** Opening an architect or builder terminal could show an empty pane with just a cursor, staying blank until you nudged the window size by hand. A full-screen terminal program only repaints when the terminal size changes, and the size the extension sent on connect could match what the program already had — so it never redrew. The extension now forces a redraw shortly after connecting if the pane is still empty, the same way the web dashboard already does. ## [3.1.9] - 2026-06-08 From cc7fbdc7361e6c111845d4f5df2a4395e7d78e8b Mon Sep 17 00:00:00 2001 From: Amr Elsayed Date: Mon, 15 Jun 2026 14:57:58 +1000 Subject: [PATCH 21/31] [PIR #1047] Strip temporary [#1047-diag] diagnostic logging --- codev/state/pir-1047_thread.md | 3 +++ packages/vscode/src/terminal-adapter.ts | 19 +------------------ 2 files changed, 4 insertions(+), 18 deletions(-) diff --git a/codev/state/pir-1047_thread.md b/codev/state/pir-1047_thread.md index 64cbd185d..de02f4fc5 100644 --- a/codev/state/pir-1047_thread.md +++ b/codev/state/pir-1047_thread.md @@ -95,3 +95,6 @@ FIX (mirrors the web): after WS open, schedule a settle-delay (500ms) repaint nu This also resolves the earlier "how do you know it painted?" concern: the web client DOESN'T detect paint — it sends an unconditional delayed resize and trusts it. So we do the same (single gated nudge), not retry-until-painted. Reverted Option A (resume-handler change) — log proved it never runs (empty buffer, no resume). Diagnostics still in (tagged [#1047-diag]) for this test. NEXT: user rebuilds/installs, opens terminal, confirms it paints + pastes [#1047-diag] log (should show "repaint nudge: ...→..." then handleData LIVE). + +## Blank-on-open CONFIRMED FIXED (2026-06-15) +User confirmed terminals now paint on open with the repaint-nudge fix. Stripped all [#1047-diag] diagnostic logging from terminal-adapter.ts (13 lines). Typecheck clean, adapter tests 18/18. The nudge + WS-open resize (#737) remain separate as decided. diff --git a/packages/vscode/src/terminal-adapter.ts b/packages/vscode/src/terminal-adapter.ts index 7188734cd..ead0b187d 100644 --- a/packages/vscode/src/terminal-adapter.ts +++ b/packages/vscode/src/terminal-adapter.ts @@ -83,7 +83,6 @@ export class CodevPseudoterminal implements vscode.Pseudoterminal { ) {} open(initialDimensions: vscode.TerminalDimensions | undefined): void { - this.log('INFO', `[#1047-diag] open(initialDimensions=${initialDimensions ? `${initialDimensions.columns}x${initialDimensions.rows}` : 'undefined'})`); if (initialDimensions) { this.lastDimensions = { cols: initialDimensions.columns, rows: initialDimensions.rows }; } @@ -135,11 +134,9 @@ export class CodevPseudoterminal implements vscode.Pseudoterminal { this.lastDimensions = { cols: dimensions.columns, rows: dimensions.rows }; if (this.replaying) { // Defer resize during replay to prevent garbled rendering (Bugfix #625) - this.log('INFO', `[#1047-diag] setDimensions(${dimensions.columns}x${dimensions.rows}) DEFERRED (replaying)`); this.pendingResize = { cols: dimensions.columns, rows: dimensions.rows }; return; } - this.log('INFO', `[#1047-diag] setDimensions(${dimensions.columns}x${dimensions.rows}) sending now`); this.sendResize(dimensions.columns, dimensions.rows); } @@ -189,10 +186,7 @@ export class CodevPseudoterminal implements vscode.Pseudoterminal { // overlap streaming content. Pause/replay messages arrive *after* // this outbound resize, so the order is auth → resize → replay → resume. if (this.lastDimensions) { - this.log('INFO', `[#1047-diag] WS open: sending resize ${this.lastDimensions.cols}x${this.lastDimensions.rows}`); this.sendResize(this.lastDimensions.cols, this.lastDimensions.rows); - } else { - this.log('INFO', '[#1047-diag] WS open: no lastDimensions, no resize sent'); } // Force a redraw-SIGWINCH after the connection settles if nothing has // rendered (#1047), mirroring the web dashboard's post-connect resize. @@ -361,11 +355,9 @@ export class CodevPseudoterminal implements vscode.Pseudoterminal { if (this.replaying) { const text = this.decoder.decode(payload, { stream: true }); const safe = this.escapeBuffer.write(text); - this.log('INFO', `[#1047-diag] handleData REPLAY: payload=${payload.length}B, rendered=${safe.length}B`); if (safe.length > 0) { this.renderedSinceConnect = true; this.writeChunked(safe); } return; } - this.log('INFO', `[#1047-diag] handleData LIVE: payload=${payload.length}B`); this.renderedSinceConnect = true; // Live overload: if rendered output falls far enough behind that the queue @@ -419,17 +411,13 @@ export class CodevPseudoterminal implements vscode.Pseudoterminal { this.clearRepaintNudge(); this.repaintNudgeTimer = setTimeout(() => { this.repaintNudgeTimer = null; - if (this.disposed || this.renderedSinceConnect) { - this.log('INFO', `[#1047-diag] repaint nudge skipped (rendered=${this.renderedSinceConnect})`); - return; - } + if (this.disposed || this.renderedSinceConnect) { return; } if (!this.lastDimensions) { return; } const { cols, rows } = this.lastDimensions; if (rows <= 1) { this.sendResize(cols, rows); return; } // A 1-row change then back guarantees a real TIOCSWINSZ delta (and thus a // SIGWINCH) even if the PTY is already at the target size, ending at the // correct dimensions. The brief intermediate frame is overwritten at once. - this.log('INFO', `[#1047-diag] repaint nudge: ${cols}x${rows - 1} -> ${cols}x${rows}`); this.sendResize(cols, rows - 1); this.sendResize(cols, rows); }, REPAINT_NUDGE_DELAY_MS); @@ -461,16 +449,13 @@ export class CodevPseudoterminal implements vscode.Pseudoterminal { switch (msg.type) { case 'seq': this.lastSeq = (msg.payload.seq as number) ?? this.lastSeq; - this.log('INFO', `[#1047-diag] control: seq=${this.lastSeq}`); break; case 'pong': break; case 'pause': - this.log('INFO', '[#1047-diag] control: pause (replay start)'); this.replaying = true; break; case 'resume': - this.log('INFO', '[#1047-diag] control: resume (replay end)'); this.replaying = false; // Flush deferred resize if (this.pendingResize) { @@ -497,8 +482,6 @@ export class CodevPseudoterminal implements vscode.Pseudoterminal { } private sendResize(cols: number, rows: number): void { - const willSend = this.ws !== null && this.ws.readyState === WebSocket.OPEN; - this.log('INFO', `[#1047-diag] sendResize(${cols}x${rows}) wsOpen=${willSend}`); this.sendControl({ type: 'resize', payload: { cols, rows } }); } From d0226ea82628ae97640d7cf445489c74c5719f91 Mon Sep 17 00:00:00 2001 From: Amr Elsayed Date: Mon, 15 Jun 2026 16:08:32 +1000 Subject: [PATCH 22/31] chore(porch): 1047 dev-approval gate-approved --- .../projects/1047-tower-terminals-architects-bui/status.yaml | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/codev/projects/1047-tower-terminals-architects-bui/status.yaml b/codev/projects/1047-tower-terminals-architects-bui/status.yaml index ed92eb8d3..b2c4eeee8 100644 --- a/codev/projects/1047-tower-terminals-architects-bui/status.yaml +++ b/codev/projects/1047-tower-terminals-architects-bui/status.yaml @@ -10,12 +10,13 @@ gates: requested_at: '2026-06-14T22:20:45.222Z' approved_at: '2026-06-14T23:10:56.668Z' dev-approval: - status: pending + status: approved requested_at: '2026-06-14T23:33:49.007Z' + approved_at: '2026-06-15T06:08:32.090Z' pr: status: pending iteration: 1 build_complete: false history: [] started_at: '2026-06-14T22:12:29.323Z' -updated_at: '2026-06-14T23:33:49.008Z' +updated_at: '2026-06-15T06:08:32.090Z' From fd86fa927a9973c8717b3705870eb46d3c729d72 Mon Sep 17 00:00:00 2001 From: Amr Elsayed Date: Mon, 15 Jun 2026 16:08:47 +1000 Subject: [PATCH 23/31] chore(porch): 1047 review phase-transition --- .../projects/1047-tower-terminals-architects-bui/status.yaml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/codev/projects/1047-tower-terminals-architects-bui/status.yaml b/codev/projects/1047-tower-terminals-architects-bui/status.yaml index b2c4eeee8..7d0f42620 100644 --- a/codev/projects/1047-tower-terminals-architects-bui/status.yaml +++ b/codev/projects/1047-tower-terminals-architects-bui/status.yaml @@ -1,7 +1,7 @@ id: '1047' title: tower-terminals-architects-bui protocol: pir -phase: implement +phase: review plan_phases: [] current_plan_phase: null gates: @@ -19,4 +19,4 @@ iteration: 1 build_complete: false history: [] started_at: '2026-06-14T22:12:29.323Z' -updated_at: '2026-06-15T06:08:32.090Z' +updated_at: '2026-06-15T06:08:47.532Z' From bdbef0f0075f73acb6ca098c9965bedbec0d3e57 Mon Sep 17 00:00:00 2001 From: Amr Elsayed Date: Mon, 15 Jun 2026 16:14:45 +1000 Subject: [PATCH 24/31] [PIR #1047] Add repaint-nudge unit tests (fires when blank, skips when rendered) --- .../src/__tests__/terminal-adapter.test.ts | 51 ++++++++++++++++++- 1 file changed, 49 insertions(+), 2 deletions(-) diff --git a/packages/vscode/src/__tests__/terminal-adapter.test.ts b/packages/vscode/src/__tests__/terminal-adapter.test.ts index bc632f17b..2885014ee 100644 --- a/packages/vscode/src/__tests__/terminal-adapter.test.ts +++ b/packages/vscode/src/__tests__/terminal-adapter.test.ts @@ -39,13 +39,14 @@ vi.mock('ws', () => { readyState = 0; binaryType = ''; closed = false; + sent: unknown[] = []; private handlers: Record void>> = {}; constructor(public url: string) { FakeWebSocket.instances.push(this); } on(event: string, cb: (...args: unknown[]) => void): this { (this.handlers[event] ||= []).push(cb); return this; } - send(): void {} + send(data: unknown): void { this.sent.push(data); } close(): void { this.closed = true; } /** Test helper: synchronously dispatch a lifecycle event to handlers. */ emit(event: string, ...args: unknown[]): void { @@ -82,9 +83,18 @@ function sendControl( socket.emit('message', Buffer.concat([Buffer.from([FRAME_CONTROL]), json])); } +/** Extract the resize control frames a fake socket has sent (#1047). */ +function sentResizes(socket: { sent: unknown[] }): Array<{ cols: number; rows: number }> { + return (socket.sent as Buffer[]) + .filter((b) => b[0] === FRAME_CONTROL) + .map((b) => JSON.parse(b.subarray(1).toString('utf-8')) as { type: string; payload: { cols: number; rows: number } }) + .filter((m) => m.type === 'resize') + .map((m) => m.payload); +} + // Imports AFTER mocks are registered. const WebSocket = (await import('ws')).default as unknown as { - instances: Array<{ closed: boolean; emit(e: string, ...a: unknown[]): void }>; + instances: Array<{ closed: boolean; readyState: number; sent: unknown[]; emit(e: string, ...a: unknown[]): void }>; OPEN: number; }; const { CodevPseudoterminal, RECONNECT_LINK_TEXT } = await import('../terminal-adapter.js'); @@ -380,3 +390,40 @@ describe('PIR #1047 — oversized replay storm prevention', () => { expect(url).not.toContain('resume'); }); }); + +describe('PIR #1047 — post-connect repaint nudge', () => { + /** Open the adapter with known dimensions and an OPEN socket so sends record. */ + function makeOpenAdapter(cols: number, rows: number) { + const { pty, writes } = makeAdapter(); + (pty as unknown as { setDimensions(d: { columns: number; rows: number }): void }) + .setDimensions({ columns: cols, rows: rows }); + const socket = currentSocket(); + socket.readyState = WebSocket.OPEN; + socket.emit('open'); + return { pty, writes, socket }; + } + + it('forces a redraw nudge (rows-1 then rows) when the pane stays blank after connect', () => { + const { socket } = makeOpenAdapter(100, 40); + socket.sent.length = 0; // drop the on-open resize; isolate the nudge + + vi.advanceTimersByTime(500); + + // The nudge is a real size delta then back: 100x39 then 100x40. + expect(sentResizes(socket)).toEqual([ + { cols: 100, rows: 39 }, + { cols: 100, rows: 40 }, + ]); + }); + + it('skips the nudge when output already rendered during the settle window', () => { + const { socket } = makeOpenAdapter(100, 40); + socket.sent.length = 0; + + // Live output arrives before the settle delay → pane is no longer blank. + sendData(socket, Buffer.from('hello', 'utf-8')); + vi.advanceTimersByTime(500); + + expect(sentResizes(socket).some((r) => r.rows === 39)).toBe(false); + }); +}); From baef221788f0dce8b2ba9b694cc156bb4149d4f0 Mon Sep 17 00:00:00 2001 From: Amr Elsayed Date: Mon, 15 Jun 2026 16:14:45 +1000 Subject: [PATCH 25/31] [PIR #1047] Review + retrospective --- codev/resources/arch.md | 4 + codev/resources/lessons-learned.md | 3 + .../1047-tower-terminals-architects-bui.md | 78 +++++++++++++++++++ 3 files changed, 85 insertions(+) create mode 100644 codev/reviews/1047-tower-terminals-architects-bui.md diff --git a/codev/resources/arch.md b/codev/resources/arch.md index b6e545104..07ff828f0 100644 --- a/codev/resources/arch.md +++ b/codev/resources/arch.md @@ -206,6 +206,10 @@ Shell / Claude / Builder process 5. **Kill**: Tower sends SIGTERM via SIGNAL frame, waits 5s, SIGKILL if needed. Cleans up socket file. 6. **Graceful degradation**: If shellper spawn fails, Tower falls back to direct node-pty (non-persistent). SQLite row has `shellper_socket = NULL`. Dashboard shows "Session persistence unavailable" warning. +#### Terminal reconnect/replay contract (#1047) + +When a WebSocket client attaches, Tower replays the ring buffer as a binary DATA frame **bracketed by `pause`/`resume` control frames** so the client can render the (potentially large) snapshot without counting it against its live-backpressure budget. Clients pass `?resume=` to request only the bytes after a sequence number (a *delta* reconnect) instead of the full buffer — the dominant cost saver when Tower is hosted remotely and reconnects are frequent. The ring-buffer `partial` (incomplete trailing line) is kept **whole and unbounded** so a full-screen TUI's alt-screen state replays faithfully; per-frame CPU is bounded instead by `pushData` scanning only the new chunk (not re-splitting the accumulated buffer). On the client side, a connecting terminal must **force a redraw** shortly after attach (a size-delta resize → SIGWINCH) — a full-screen TUI only repaints on a size *change*, and the connect-time resize can be a same-size no-op; both the web dashboard (`Terminal.tsx`) and the VSCode adapter (`terminal-adapter.ts`) do this. Under genuine *live* overload, clients **drop** ephemeral output (the app repaints) rather than reconnecting — reconnecting to relieve backpressure re-pulls the same payload and storms. + #### Startup Readiness Barrier (#997) Tower binds its port and starts serving immediately, but `reconcileTerminalSessions()` (which re-registers persistent sessions in the `workspaceTerminals` map) runs *after* `server.listen()`. To stop the first post-restart read from seeing a half-populated `role → terminalId` map, a monotonic settled-once barrier in `tower-terminals.ts` gates the readers of reconcile's output: diff --git a/codev/resources/lessons-learned.md b/codev/resources/lessons-learned.md index 82075f2f1..03371829d 100644 --- a/codev/resources/lessons-learned.md +++ b/codev/resources/lessons-learned.md @@ -424,6 +424,9 @@ Generalizable wisdom extracted from review documents, ordered by impact. Updated - [From 913] VSCode's `TreeView` exposes `reveal(…, {expand})` but **no per-item collapse** — and it replays its own remembered expand/collapse state for any `TreeItem.id` it has already seen, ignoring the provider's `collapsibleState` on a re-render. The only blunt collapse primitive, `workbench.actions.treeView..collapseAll`, is *tree-wide* (it flattens group headers too, not just the rows you meant). To force *specific* rows collapsed while leaving others alone, **version the id**: render each row as `#` and bump the version for the rows you want collapsed — a never-seen id makes VSCode fall back to the provider's default state. Keep the row you want left open at its *verbatim* current id (don't reconstruct it from a counter — store the literal string, so a render landing between a bump and the next interaction can't strand it on a stale version). The monotonic version is load-bearing: a re-opened row must never reuse an id VSCode remembers as expanded, or it pops back open. Encapsulate the three moving parts (version, open-id, open-builder) in one small named unit rather than scattering counters across the provider. - [From 913] Don't persist UI state whose lifetime is shorter than the thing it describes. The Builders tree's per-group expansion was saved to `workspaceState`, but active builders live hours and the set churns constantly — persisting "you collapsed VSCODE last Tuesday" across reloads served no one and actively masked an accordion bug (programmatic `collapseAll` wrote `false` for every group, indistinguishable from a user chevron-click). Dropping persistence entirely and defaulting to expanded each session is *less* code and the better UX; VSCode's native per-id in-session memory already keeps a user-collapsed group collapsed until reload, for free. Persistence still earns its keep in the sibling Backlog view (hundreds of long-lived items, users curate which areas matter) — the lesson is to match persistence to lifetime per-view, not to apply one mechanism uniformly. +- [From 1047] When the same symptom appears in one client but not another that shares the backend, the bug is in the client-specific layer — diff the two clients, don't debug the backend. Terminals froze/blanked in the VSCode extension but rendered fine in the web dashboard, both hitting the same Tower/WS/PTY/shellper/Claude. That single fact proved the app *was* painting and the bytes *did* exist, so the cause was the extension's connect/render path — not a missing SIGWINCH on the app side, which is where two failed hypotheses had been chasing it. Reading the web client (`Terminal.tsx`) revealed the missing piece: it unconditionally sends a "redraw" resize ~500ms after connect ("send SIGWINCH to make the shell redraw"), and even has a `skipReplay` mode that *just* sends SIGWINCH. The VSCode `Pseudoterminal` adapter had no equivalent, so a connect-time resize that matched the PTY's size was a no-op and the full-screen TUI never repainted. The fix mirrors the web: a gated post-connect nudge (`rows-1` then `rows`) guarantees a real `TIOCSWINSZ` delta. Corollary: a full-screen TUI in the alternate screen buffer only repaints on a size *change* — a same-size resize sends no SIGWINCH, so "re-assert the size" must be a real delta, not a re-send. +- [From 1047] A buffer bounded by newline count is unbounded for a newline-free stream. `RingBuffer.partial` and `ShellperReplayBuffer` both capped by *lines*; a full-screen TUI redraws in place with cursor-addressing and `\r` and emits almost no `\n`, so a single 15 MB session had zero newlines — the partial grew without limit and every `pushData` re-split the whole accumulated string (O(n²)). The decisive evidence was the on-disk PTY logs (`~/.agent-farm/logs/*.log`): measuring "longest run without `\n`" instantly distinguished the pathological session (whole file) from normal ones (~20 bytes/line). Fix the hot path by scanning only the *new* chunk; bound by bytes only where a real ceiling is needed. (Caution: a byte cap that front-trims the buffer corrupts TUI *replay* — the alt-screen state lives in the cumulative stream from the alt-screen-enter onward — so we kept the buffer whole and bounded CPU via scan-only instead.) +- [From 1047] A backpressure "relief valve" that re-fetches the overflowing payload is an infinite loop. The VSCode terminal's old rule was "receive buffer > 1 MB → disconnect and reconnect for a fresh replay" — but the oversized thing *was* the replay, so each reconnect re-delivered it: ~14,000 connect→overflow→reconnect cycles in under an hour (visible 1:1:1 in the client log), pegging one core and starving every terminal. For ephemeral live output, *drop* under overload (the app repaints); reserve reconnect for genuine transport loss, and never make the overload remedy re-pull the same bytes. - [From 787] When adding a field to a multi-forge concept contract (`pr-list`, `issue-list`, …), per-CLI data availability diverges — do not assume parity across `gh`/`glab`/`tea`. Verify each empirically (live output or, failing that, the CLI's own `--help` and official docs): `gh pr list --json` exposes `isDraft` + `reviewRequests` (reviewer objects → flatten `.login`, dropping teams); `glab mr list --output json` exposes `draft` + `reviewers[].username` (same command, just add jq); but `tea pulls list` exposes *neither* (its JSON is limited to the selectable `--fields`, which omit draft/reviewers — the data exists only via raw `tea api`). Populate where the existing command can; default safely (`[]`/`false`) where it can't, and document the verified reason in the script. Defaulting a field the CLI *does* expose silently drops working data; rewriting a script onto a different command (e.g. `tea api`) to reach an unavailable field is a separate, larger change. Make the new required fields safe end-to-end by defaulting at the server mapping (`?? []` / `?? false`) so a non-conforming forge degrades rather than emitting `undefined`. --- diff --git a/codev/reviews/1047-tower-terminals-architects-bui.md b/codev/reviews/1047-tower-terminals-architects-bui.md new file mode 100644 index 000000000..f35964440 --- /dev/null +++ b/codev/reviews/1047-tower-terminals-architects-bui.md @@ -0,0 +1,78 @@ +# PIR Review: Terminal freeze + blank-on-open (oversized-replay storm and missing connect-time redraw) + +Fixes #1047 + +## Summary + +Two client/server terminal bugs were fixed. (1) **Freeze:** a full-screen TUI (Claude's UI) redraws in place and emits almost no newlines, so Tower's reconnection buffers — bounded by *line count* — grew unbounded; the resulting multi-MB replay overflowed the VSCode client's 1 MB receive budget, and the client's "disconnect-and-reconnect for replay" response re-fetched the same oversized snapshot in a tight loop (~14k cycles/hour), pegging one CPU core and starving every terminal. (2) **Blank-on-open:** a freshly-attached terminal could render a blank pane until a manual window resize, because the connect-time resize could be a same-size no-op and a full-screen TUI only repaints on a size *change*. Tower now scans only new output (killing the O(n²) re-scan) and brackets replay with `pause`/`resume` so it is paced and excluded from backpressure; the client drops live output under overload instead of reconnecting, reconnects with a `?resume=` delta, and forces a guaranteed redraw shortly after connect (mirroring the web dashboard). + +## Files Changed + +- `packages/codev/src/terminal/ring-buffer.ts` (+39 / -…) — `pushData` scans only the incoming chunk (O(|data|), not O(|partial|)); `partial` kept whole for faithful replay; `partialBytes` getter for observability. +- `packages/codev/src/terminal/pty-session.ts` (+14) — `attachShellper` idempotency (drop prior client listeners on re-attach); `partialBytes` passthrough. +- `packages/codev/src/terminal/pty-manager.ts` (+30 / -…) — `createSessionRaw` tears down a colliding session before overwrite; `inspectPartials()` for the monitor. +- `packages/codev/src/agent-farm/servers/tower-websocket.ts` (+8 / -…) — bracket replay with `pause`/`resume` control frames. +- `packages/codev/src/agent-farm/servers/tower-server.ts` (+32) — periodic ring-buffer-partial monitor (observability). +- `packages/vscode/src/terminal-adapter.ts` (+124 / -…) — replay excluded from backpressure; live overload **drops** instead of reconnecting; `?resume=` delta reconnect; post-connect repaint nudge. +- `packages/codev/src/terminal/__tests__/ring-buffer.test.ts` (+25), `packages/codev/src/agent-farm/__tests__/tower-websocket.test.ts` (+14 / -…), `packages/vscode/src/__tests__/terminal-adapter.test.ts` (+137 / -…) — unit coverage. +- `packages/vscode/CHANGELOG.md` (+7) — user-facing entries for both fixes. +- `codev/resources/arch.md` (+4), `codev/resources/lessons-learned.md` (+3) — see sections below. + +(`git diff --stat`: 12 files, +410 / -27, excluding plan/thread/status artifacts.) + +## Commits + +- `8868ac86` [PIR #1047] Fix A+E: byte-cap RingBuffer.partial (scan-only pushData) + reconnect listener hygiene +- `7d42dac9` [PIR #1047] Fix B: byte-cap ShellperReplayBuffer (bounds no-newline replay across restart) +- `f7e6b3a2` [PIR #1047] Fix C+D: pause/resume replay bracket, drop-not-reconnect on live overload, resume-delta reconnect +- `9f282745` [PIR #1047] Instrumentation: log terminal ring-buffer partial sizes on a Tower monitor interval +- `757a75a5` [PIR #1047] Update tower-websocket test for pause/resume replay bracket +- `2f5e6351` [PIR #1047] docs: user-facing CHANGELOG entry for terminal-freeze fix +- `b05b73f2` [PIR #1047] Hoist replay-cap resolution out of constructor args; document the three distinct size limits +- `02ff5546` [PIR #1047] Remove byte caps: keep partial/replay whole for faithful TUI replay +- `7100a5a3` [PIR #1047] Re-assert terminal size after replay (Option A) + temporary diag logging +- `81c4a83b` [PIR #1047] Force a post-connect redraw nudge (mirror web client's SIGWINCH-on-connect); revert Option A +- `6d994f72` [PIR #1047] docs: changelog entry for terminal blank-on-open fix +- `cc7fbdc7` [PIR #1047] Strip temporary [#1047-diag] diagnostic logging + +(Plus the review/test-coverage commit carrying this file.) + +## Test Results + +- `pnpm build`: ✓ pass +- `pnpm --filter @cluesmith/codev test`: ✓ 3308 passed / 0 failed / 48 skipped +- `packages/vscode` adapter suite: ✓ 20 passed (including 5 storm-prevention + 2 repaint-nudge tests new for #1047); full VSCode suite previously green at 414. +- Manual verification (human, `dev-approval` gate, macOS VSCode against the worktree Tower): confirmed via the temporary `[#1047-diag]` client log that there is no connect→backpressure→reconnect storm, `seq` heartbeats stay clean, and — after the nudge landed — terminals **paint on open without a manual resize**. + +## Architecture Updates + +**COLD (`codev/resources/arch.md`)** — added a "Terminal reconnect/replay contract (#1047)" subsection under Shellper Lifecycle documenting the now-load-bearing behaviors: replay bracketed by `pause`/`resume` (excluded from client backpressure), `?resume=` delta reconnect, the deliberately-unbounded `partial` (faithful TUI replay) with CPU bounded by scan-only `pushData`, the client's connect-time forced redraw, and drop-not-reconnect under live overload. + +No HOT (`arch-critical.md`) change: these are subsystem-level protocol details, not a top-tier always-injected system-shape fact, and the hot file is at its cap. + +## Lessons Learned Updates + +**COLD (`codev/resources/lessons-learned.md`, Debugging section)** — three `[From 1047]` entries: (1) cross-client differential diagnosis — a symptom in one client but not another sharing the backend localizes the bug to the client layer (web rendered fine → the app *was* painting → the VSCode adapter was at fault), and a full-screen TUI only repaints on a size *change*; (2) a buffer bounded by newline count is unbounded for a newline-free stream, with the on-disk PTY logs ("longest run without `\n`") as the decisive evidence, and the caution that front-trimming a buffer corrupts TUI replay; (3) a backpressure relief valve that re-fetches the overflowing payload is an infinite loop — drop ephemeral output, don't reconnect. + +No HOT (`lessons-critical.md`) change: these are debugging recipes, not a top-tier cross-cutting rule, and the hot file is at its cap. + +## Things to Look At During PR Review + +- **The dropped byte caps (history).** Early commits added byte caps to `RingBuffer.partial` and `ShellperReplayBuffer`; they were **reverted** (`02ff5546`) after dev-approval testing showed front-trimming corrupts a full-screen TUI's replay (its alt-screen state lives in the cumulative stream from the alt-screen-enter onward). The freeze is fixed by *scan-only* `pushData` + the replay bracket, not by bounding the buffer. Net effect: `partial`/replay are functionally back to `main`'s faithful-but-unbounded behavior; the new `tower-server.ts` monitor logs partial size so unbounded growth (a known, issue-rated-minor trade-off) is observable. +- **The repaint nudge mechanism** (`terminal-adapter.ts`, `scheduleRepaintNudge`). It is a `rows-1` → `rows` size delta 500ms after connect, **gated** on `renderedSinceConnect` so a reconnect that already painted via replay doesn't reflow. The brief 1-row intermediate frame is intentional (guarantees a real `TIOCSWINSZ` delta even when the PTY is already at the target size). Confirm the gating and that it's cleared on close/reconnect (`resetStreamState`). +- **`#737` and the nudge are kept separate** (deliberate): `#737`'s on-open resize sets the correct size immediately (no flicker, common case); the nudge is the delayed backstop for the same-size no-op case. They're complementary, not redundant. +- **Single-pass CMAP caveat (PIR):** any `REQUEST_CHANGES` from the 3-way consult is not independently re-reviewed — please scrutinize the disposition at this gate. + +## How to Test Locally + +- **View diff**: VSCode sidebar → right-click builder `pir-1047` → **Review Diff**. +- **Run dev server**: VSCode sidebar → **Run Dev Server**, or `afx dev pir-1047`. +- **What to verify**: + - Open an architect/builder terminal running Claude's full-screen UI and let it redraw past ~1 MB of no-newline output; Tower CPU stays low and there is **no** connect→backpressure→reconnect storm (contrast the original ~14k-cycle capture). + - Open a terminal that previously blanked — it **paints on open** without a manual window resize. + - Reconnect a tab (close/reopen) → clean repaint; the reconnect URL carries `?resume=` (delta). + - Optional: leave Tower running for hours on a real workload and confirm CPU stays flat (the true end-to-end soak; cannot fit a gate session). + +## Flaky Tests + +None skipped. (Earlier full-suite runs showed environmental failures in `session-manager`/`adopt`/`hot-tier` integration tests when `dist/` and the copied skeleton were absent; a full `pnpm build` resolves them — they are not related to this change and pass after build.) From d10d01b2c165d6c7e927146656efd39b726c714f Mon Sep 17 00:00:00 2001 From: Amr Elsayed Date: Mon, 15 Jun 2026 16:15:14 +1000 Subject: [PATCH 26/31] chore(porch): 1047 record PR #1050 --- .../1047-tower-terminals-architects-bui/status.yaml | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/codev/projects/1047-tower-terminals-architects-bui/status.yaml b/codev/projects/1047-tower-terminals-architects-bui/status.yaml index 7d0f42620..85a6dd3f1 100644 --- a/codev/projects/1047-tower-terminals-architects-bui/status.yaml +++ b/codev/projects/1047-tower-terminals-architects-bui/status.yaml @@ -19,4 +19,9 @@ iteration: 1 build_complete: false history: [] started_at: '2026-06-14T22:12:29.323Z' -updated_at: '2026-06-15T06:08:47.532Z' +updated_at: '2026-06-15T06:15:14.784Z' +pr_history: + - phase: review + pr_number: 1050 + branch: builder/pir-1047 + created_at: '2026-06-15T06:15:14.783Z' From 02e212a9e72e7aa121640495d50b7e1c9059ddd4 Mon Sep 17 00:00:00 2001 From: Amr Elsayed Date: Mon, 15 Jun 2026 16:15:35 +1000 Subject: [PATCH 27/31] chore(porch): 1047 review build-complete --- .../projects/1047-tower-terminals-architects-bui/status.yaml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/codev/projects/1047-tower-terminals-architects-bui/status.yaml b/codev/projects/1047-tower-terminals-architects-bui/status.yaml index 85a6dd3f1..a75c62c7b 100644 --- a/codev/projects/1047-tower-terminals-architects-bui/status.yaml +++ b/codev/projects/1047-tower-terminals-architects-bui/status.yaml @@ -16,10 +16,10 @@ gates: pr: status: pending iteration: 1 -build_complete: false +build_complete: true history: [] started_at: '2026-06-14T22:12:29.323Z' -updated_at: '2026-06-15T06:15:14.784Z' +updated_at: '2026-06-15T06:15:35.197Z' pr_history: - phase: review pr_number: 1050 From a3a4edb3221a5e97b742f94238eb7c5ca99ad029 Mon Sep 17 00:00:00 2001 From: Amr Elsayed Date: Mon, 15 Jun 2026 16:25:02 +1000 Subject: [PATCH 28/31] [PIR #1047] CMAP: document getSince no-newline limit + add Fix E (attachShellper idempotency) tests --- .../__tests__/pty-session-attach.test.ts | 79 +++++++++++++++++++ packages/codev/src/terminal/ring-buffer.ts | 15 +++- 2 files changed, 93 insertions(+), 1 deletion(-) create mode 100644 packages/codev/src/terminal/__tests__/pty-session-attach.test.ts diff --git a/packages/codev/src/terminal/__tests__/pty-session-attach.test.ts b/packages/codev/src/terminal/__tests__/pty-session-attach.test.ts new file mode 100644 index 000000000..ff6c6583d --- /dev/null +++ b/packages/codev/src/terminal/__tests__/pty-session-attach.test.ts @@ -0,0 +1,79 @@ +import { describe, it, expect } from 'vitest'; +import { EventEmitter } from 'node:events'; +import { PtySession, type PtySessionConfig } from '../pty-session.js'; +import type { IShellperClient } from '../shellper-client.js'; + +/** + * Fix E (#1047): attachShellper must be idempotent — re-attaching a session to + * a new shellper client must drop the listeners it installed on the previous + * client, so a reconnect can't accumulate duplicate `data` listeners that each + * re-run onPtyData for every PTY byte (the listener-leak the hardening targets). + */ + +function makeFakeClient(): IShellperClient { + const emitter = new EventEmitter() as unknown as IShellperClient & { _lastDataAt: number }; + // attachShellper reads client.lastDataAt and subscribes data/exit/close. + Object.defineProperty(emitter, 'lastDataAt', { get: () => Date.now() }); + return emitter; +} + +function makeSession(): PtySession { + const config: PtySessionConfig = { + id: 'sess-1', + command: '', + args: [], + cols: 80, + rows: 24, + cwd: '/tmp', + env: {}, + label: 'test', + logDir: '/tmp', + diskLogEnabled: false, // avoid touching the filesystem + }; + return new PtySession(config); +} + +describe('PtySession.attachShellper idempotency (#1047 Fix E)', () => { + it('removes listeners from the previous client on re-attach', () => { + const session = makeSession(); + const clientA = makeFakeClient(); + const clientB = makeFakeClient(); + + session.attachShellper(clientA, Buffer.alloc(0), 1234); + expect(clientA.listenerCount('data')).toBe(1); + + // Re-attach to a new client (a reconnect): A's listeners must be dropped. + session.attachShellper(clientB, Buffer.alloc(0), 5678); + expect(clientA.listenerCount('data')).toBe(0); + expect(clientA.listenerCount('exit')).toBe(0); + expect(clientA.listenerCount('close')).toBe(0); + expect(clientB.listenerCount('data')).toBe(1); + }); + + it('a data frame on the detached client no longer reaches the ring buffer', () => { + const session = makeSession(); + const clientA = makeFakeClient(); + const clientB = makeFakeClient(); + + session.attachShellper(clientA, Buffer.alloc(0), 1234); + session.attachShellper(clientB, Buffer.alloc(0), 5678); + + const seqBefore = session.ringBuffer.currentSeq; + clientA.emit('data', Buffer.from('stale\n', 'utf-8')); // from the old client + expect(session.ringBuffer.currentSeq).toBe(seqBefore); // ignored + + clientB.emit('data', Buffer.from('fresh\n', 'utf-8')); // from the live client + expect(session.ringBuffer.currentSeq).toBe(seqBefore + 1); + }); + + it('does not drop listeners when re-attaching the same client instance', () => { + const session = makeSession(); + const clientA = makeFakeClient(); + + session.attachShellper(clientA, Buffer.alloc(0), 1234); + session.attachShellper(clientA, Buffer.alloc(0), 1234); // same instance + // Guard only fires for a *different* client, so this is a no-op re-subscribe + // path; the session stays attached and functional. + expect(clientA.listenerCount('data')).toBeGreaterThanOrEqual(1); + }); +}); diff --git a/packages/codev/src/terminal/ring-buffer.ts b/packages/codev/src/terminal/ring-buffer.ts index b42746719..f66c26b01 100644 --- a/packages/codev/src/terminal/ring-buffer.ts +++ b/packages/codev/src/terminal/ring-buffer.ts @@ -72,7 +72,20 @@ export class RingBuffer { return result; } - /** Get lines starting from a given sequence number (for resume). */ + /** + * Get lines starting from a given sequence number (for resume). + * + * Note (#1047): `seq` advances only on completed (newline-terminated) lines. + * A full-screen TUI emits no newlines, so for such a session `seq` stays at + * whatever the last real line was and a client that is caught up to it gets + * `[]` here — the in-progress `partial` (the current screen) is NOT replayed + * on a delta resume. That gap is covered by the client's post-connect repaint + * nudge, which forces the app to redraw on (re)connect (see + * `terminal-adapter.ts`). True byte-granular resume for no-newline streams was + * considered and deliberately descoped (it would require a byte-addressable + * seq and breaks the existing line-based wire contract); the nudge makes it + * unnecessary for correctness. + */ getSince(sinceSeq: number): string[] { const linesAvailable = this.count; const oldestSeq = this.seq - linesAvailable + 1; From 49983f3574a9d108b5af81fd4a91af4a8c19cf58 Mon Sep 17 00:00:00 2001 From: Amr Elsayed Date: Mon, 15 Jun 2026 16:25:02 +1000 Subject: [PATCH 29/31] [PIR #1047] CMAP: correct review delta-resume wording + record rebuttals --- .../1047-review-iter1-rebuttals.md | 39 +++++++++++++++++++ .../1047-tower-terminals-architects-bui.md | 4 +- 2 files changed, 41 insertions(+), 2 deletions(-) create mode 100644 codev/projects/1047-tower-terminals-architects-bui/1047-review-iter1-rebuttals.md diff --git a/codev/projects/1047-tower-terminals-architects-bui/1047-review-iter1-rebuttals.md b/codev/projects/1047-tower-terminals-architects-bui/1047-review-iter1-rebuttals.md new file mode 100644 index 000000000..f4d1e8402 --- /dev/null +++ b/codev/projects/1047-tower-terminals-architects-bui/1047-review-iter1-rebuttals.md @@ -0,0 +1,39 @@ +# PIR #1047 — CMAP iteration-1 rebuttals / dispositions + +3-way consult verdicts: **Codex `REQUEST_CHANGES`**, **Claude `APPROVE`**, **Gemini failed** (see below). PIR is single-pass (`max_iterations: 1`) — these dispositions are not independently re-reviewed; the human at the `pr` gate is the final check. + +## Gemini — DISREGARD (not a real review) + +The Gemini consult misfired: its output is *"Hi! The workspace directory is currently empty, and your prompt was `--sandbox`. Could you please clarify…"* — it ran against an empty sandbox and never saw the diff. This is a known agy/Gemini-consult environment issue, not a verdict. Porch grep-defaulted it to REQUEST_CHANGES because the text lacks "APPROVE". No actionable content; disregarded. + +## Codex — REQUEST_CHANGES (HIGH) + +### C1. "Byte-addressable seq not implemented → no true delta resume for no-newline streams" — REBUTTED (deliberate descope), with a doc fix + +Accurate observation, but not a defect. `seq` stays line-based, so for a pure no-newline TUI stream `currentSeq` stays at the last real line and `connectUrl()` sends a full (faithful) replay rather than a `?resume=` delta. + +This was **deliberately descoped during dev-approval**. The plan's byte-addressable seq existed to make a *bounded* (byte-capped) buffer resumable. But the byte caps were reverted (see C2) because front-trimming corrupts a full-screen TUI's replay — its alt-screen state lives in the cumulative stream from the alt-screen-enter onward, so a trimmed buffer renders blank/garbled (the exact regression the human caught at the gate). With the buffer kept whole, a no-newline reconnect does a full faithful replay, which is correct — it's `main`'s behavior, just without the bandwidth optimization for the pathological case. Correctness on (re)connect is guaranteed by the **post-connect repaint nudge** (forces a redraw SIGWINCH), which is implemented and human-verified. + +Re-introducing byte-addressable seq + byte caps to chase the delta optimization would reintroduce the replay-corruption regression. Not warranted. + +**Fix applied (the legitimate part):** the implementation was correct but the *docs overstated* it. Corrected: +- Review Summary now scopes `?resume=` to "newline-bearing streams" and adds an explicit Scope note on the descope. +- `RingBuffer.getSince` now carries a comment documenting that `seq` advances only on completed lines, that a caught-up no-newline client gets `[]`, and that the repaint nudge covers it. + +### C2. "Shellper byte cap (Fix B) absent → unbounded restart replay for zero-newline sessions" — REBUTTED (deliberate descope) + +Accurate: `ShellperReplayBuffer` is line-bounded only, so a no-newline session's shellper replay grows unbounded and Tower replays it fully on restart. This was **reverted on purpose**, for the same reason as the RingBuffer byte cap: a byte cap front-trims the buffer, which corrupts the TUI replay Tower reseeds from on reconnect. Unlike the Tower side, the shellper side has **no O(n²) hot path** (it just appends chunks and counts newlines), so its absence does **not** reintroduce the freeze — the freeze is fixed by scan-only `pushData` + the replay bracket + drop-not-reconnect, none of which depend on Fix B. The remaining cost is memory growth for long no-newline sessions, which issue #1047 explicitly rated minor/orthogonal ("+76 MB over 10h… probably orthogonal; the CPU is the load-bearing signal"), and which is now **observable** via the new `tower-server.ts` partial-size monitor. Accepted trade-off; not re-adding. + +### C3. "Test coverage and review overstate what shipped" — PARTIALLY VALID, FIXED + +The review did describe `?resume=` delta reconnect as the delivered contract without the no-newline caveat. **Fixed** (C1's doc corrections). There is intentionally no "no-newline seq/getSince regression test" because there is no byte-seq behavior to pin — the line-based behavior is now documented in the `getSince` comment instead. + +## Claude — APPROVE (HIGH), with minor notes — ADDRESSED + +- **"No Fix E unit tests"** — **Fixed.** Added `packages/codev/src/terminal/__tests__/pty-session-attach.test.ts` (3 tests): re-attach drops the previous client's `data`/`exit`/`close` listeners; a stale frame on the detached client no longer reaches the ring buffer; same-instance re-attach is a no-op. +- **"Document the getSince no-newline limitation in the code comment"** — **Fixed** (same comment as C1). +- **"`docs/releases/UNRELEASED.md` not updated"** — Not applicable: the file does not exist on this branch (only `UNRELEASED.template.md`); it's a between-releases gap, not an omission. CHANGELOG.md carries both user-facing entries. + +## Net change from this iteration + +Docs/comments corrected for accuracy; one regression-test gap (Fix E) closed. No behavioral code change — the descoped items (byte caps, byte-addressable seq) are intentionally not re-added, as re-adding them would regress the human-verified faithful-replay behavior. The seq/Fix-B descope is the decision the human should confirm at the `pr` gate. diff --git a/codev/reviews/1047-tower-terminals-architects-bui.md b/codev/reviews/1047-tower-terminals-architects-bui.md index f35964440..b0bd207ec 100644 --- a/codev/reviews/1047-tower-terminals-architects-bui.md +++ b/codev/reviews/1047-tower-terminals-architects-bui.md @@ -4,7 +4,7 @@ Fixes #1047 ## Summary -Two client/server terminal bugs were fixed. (1) **Freeze:** a full-screen TUI (Claude's UI) redraws in place and emits almost no newlines, so Tower's reconnection buffers — bounded by *line count* — grew unbounded; the resulting multi-MB replay overflowed the VSCode client's 1 MB receive budget, and the client's "disconnect-and-reconnect for replay" response re-fetched the same oversized snapshot in a tight loop (~14k cycles/hour), pegging one CPU core and starving every terminal. (2) **Blank-on-open:** a freshly-attached terminal could render a blank pane until a manual window resize, because the connect-time resize could be a same-size no-op and a full-screen TUI only repaints on a size *change*. Tower now scans only new output (killing the O(n²) re-scan) and brackets replay with `pause`/`resume` so it is paced and excluded from backpressure; the client drops live output under overload instead of reconnecting, reconnects with a `?resume=` delta, and forces a guaranteed redraw shortly after connect (mirroring the web dashboard). +Two client/server terminal bugs were fixed. (1) **Freeze:** a full-screen TUI (Claude's UI) redraws in place and emits almost no newlines, so Tower's reconnection buffers — bounded by *line count* — grew unbounded; the resulting multi-MB replay overflowed the VSCode client's 1 MB receive budget, and the client's "disconnect-and-reconnect for replay" response re-fetched the same oversized snapshot in a tight loop (~14k cycles/hour), pegging one CPU core and starving every terminal. (2) **Blank-on-open:** a freshly-attached terminal could render a blank pane until a manual window resize, because the connect-time resize could be a same-size no-op and a full-screen TUI only repaints on a size *change*. Tower now scans only new output (killing the O(n²) re-scan) and brackets replay with `pause`/`resume` so it is paced and excluded from backpressure; the client drops live output under overload instead of reconnecting, reconnects with a `?resume=` delta *for newline-bearing streams*, and forces a guaranteed redraw shortly after connect (mirroring the web dashboard). **Scope note:** the plan's byte-addressable seq and shellper-side byte cap (Fix B) were deliberately descoped during dev-approval — byte-trimming the buffer corrupts a full-screen TUI's replay (its alt-screen state lives in the cumulative stream), so `partial`/replay are kept whole. Consequence: for a pure no-newline TUI stream, a reconnect falls back to full (faithful) replay rather than a delta, and the shellper restart-replay is line-bounded only (unbounded for no-newline output — a memory trade-off the issue itself rated minor/orthogonal, now observable via the new partial-size monitor). The post-connect repaint nudge covers correctness in both cases. ## Files Changed @@ -61,7 +61,7 @@ No HOT (`lessons-critical.md`) change: these are debugging recipes, not a top-ti - **The dropped byte caps (history).** Early commits added byte caps to `RingBuffer.partial` and `ShellperReplayBuffer`; they were **reverted** (`02ff5546`) after dev-approval testing showed front-trimming corrupts a full-screen TUI's replay (its alt-screen state lives in the cumulative stream from the alt-screen-enter onward). The freeze is fixed by *scan-only* `pushData` + the replay bracket, not by bounding the buffer. Net effect: `partial`/replay are functionally back to `main`'s faithful-but-unbounded behavior; the new `tower-server.ts` monitor logs partial size so unbounded growth (a known, issue-rated-minor trade-off) is observable. - **The repaint nudge mechanism** (`terminal-adapter.ts`, `scheduleRepaintNudge`). It is a `rows-1` → `rows` size delta 500ms after connect, **gated** on `renderedSinceConnect` so a reconnect that already painted via replay doesn't reflow. The brief 1-row intermediate frame is intentional (guarantees a real `TIOCSWINSZ` delta even when the PTY is already at the target size). Confirm the gating and that it's cleared on close/reconnect (`resetStreamState`). - **`#737` and the nudge are kept separate** (deliberate): `#737`'s on-open resize sets the correct size immediately (no flicker, common case); the nudge is the delayed backstop for the same-size no-op case. They're complementary, not redundant. -- **Single-pass CMAP caveat (PIR):** any `REQUEST_CHANGES` from the 3-way consult is not independently re-reviewed — please scrutinize the disposition at this gate. +- **CMAP disposition (single-pass — please scrutinize at this gate).** 3-way consult: Codex `REQUEST_CHANGES`, Claude `APPROVE`, Gemini failed (consult ran against an empty sandbox and returned a clarification prompt, not a review — disregarded). Codex's two substantive points are the *deliberately descoped* byte-addressable seq and shellper byte cap (Fix B): both were dropped during dev-approval because byte-trimming a buffer corrupts a full-screen TUI's faithful replay (the regression the human caught when terminals went blank) — re-adding them would reintroduce that regression, and the issue itself rated the resulting unbounded memory as minor/orthogonal. So those are **rebutted, not fixed** (full rationale in `codev/projects/1047-*/1047-review-iter1-rebuttals.md`). The one point both reviewers raised that *was* legitimate — the review/comments overstating the `?resume=` delta as universal — is **fixed**: the Summary now scopes it to newline-bearing streams, and `RingBuffer.getSince` carries a comment documenting the no-newline behavior + nudge mitigation. Claude's Fix E test gap is **closed** (`pty-session-attach.test.ts`). Because PIR will not re-review, the human at this gate is the final check on whether the seq/Fix-B descope is acceptable. ## How to Test Locally From 25efe9fcd6eba6886aedb608d2717b32faf3a90d Mon Sep 17 00:00:00 2001 From: Amr Elsayed Date: Mon, 15 Jun 2026 16:25:13 +1000 Subject: [PATCH 30/31] chore(porch): 1047 pr gate-requested --- .../projects/1047-tower-terminals-architects-bui/status.yaml | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/codev/projects/1047-tower-terminals-architects-bui/status.yaml b/codev/projects/1047-tower-terminals-architects-bui/status.yaml index a75c62c7b..b0534da96 100644 --- a/codev/projects/1047-tower-terminals-architects-bui/status.yaml +++ b/codev/projects/1047-tower-terminals-architects-bui/status.yaml @@ -15,13 +15,15 @@ gates: approved_at: '2026-06-15T06:08:32.090Z' pr: status: pending + requested_at: '2026-06-15T06:25:13.811Z' iteration: 1 build_complete: true history: [] started_at: '2026-06-14T22:12:29.323Z' -updated_at: '2026-06-15T06:15:35.197Z' +updated_at: '2026-06-15T06:25:13.812Z' pr_history: - phase: review pr_number: 1050 branch: builder/pir-1047 created_at: '2026-06-15T06:15:14.783Z' +pr_ready_for_human: true From 3fce9e1737fed741842e748d820c225ee242d4dc Mon Sep 17 00:00:00 2001 From: Amr Elsayed Date: Mon, 15 Jun 2026 16:25:55 +1000 Subject: [PATCH 31/31] [PIR #1047] thread: review phase + CMAP disposition --- codev/state/pir-1047_thread.md | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/codev/state/pir-1047_thread.md b/codev/state/pir-1047_thread.md index de02f4fc5..71bb36311 100644 --- a/codev/state/pir-1047_thread.md +++ b/codev/state/pir-1047_thread.md @@ -98,3 +98,13 @@ Reverted Option A (resume-handler change) — log proved it never runs (empty bu ## Blank-on-open CONFIRMED FIXED (2026-06-15) User confirmed terminals now paint on open with the repaint-nudge fix. Stripped all [#1047-diag] diagnostic logging from terminal-adapter.ts (13 lines). Typecheck clean, adapter tests 18/18. The nudge + WS-open resize (#737) remain separate as decided. + +## Review phase + CMAP (2026-06-15) +PR #1050 opened (Fixes #1047), recorded with porch. 3-way consult (single-pass): +- codex=REQUEST_CHANGES: byte-addressable seq + shellper byte-cap (Fix B) dropped → no true delta resume for no-newline + unbounded shellper restart-replay; review overstated ?resume= contract. +- claude=APPROVE: deviations justified/documented; flagged Fix E lacks tests + getSince comment. +- gemini=FAILED (empty-sandbox misfire, not a real review). + +Disposition: REBUTTED the descopes (byte caps corrupt faithful TUI replay = the blank-screen regression; freeze fixed without them; memory rated minor + now monitored). FIXED the legitimate overlap (review scoped ?resume= to newline streams + getSince doc comment). CLOSED Fix E gap (pty-session-attach.test.ts, 3 tests). Rebuttals in codev/projects/1047-*/1047-review-iter1-rebuttals.md. + +pr gate PENDING. Architect notified (led with REQUEST_CHANGES + disposition). Waiting for human merge + pr-gate approval. Tests: core terminal 231 pass; vscode 416 pass; build green.