From 27ef19c47d77097e02552de2d8f98ba737a95e93 Mon Sep 17 00:00:00 2001 From: NagyVikt Date: Mon, 18 May 2026 14:42:22 +0200 Subject: [PATCH] feat(codex-fleet): halve worker pool default + cap Node MCP children + idle worker auto-exit MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Three changes, all in scripts/codex-fleet/, addressing the ~10 GB resident memory floor observed when the fleet is up (16 codex CLIs + 258 node helpers; each codex CLI holds 200-400 MB of native heap that does not shrink while idle). codex-fleet-2.sh - New WORKER_COUNT env (default 4, was hardcoded 8). The RESERVE_ACCOUNTS array is unchanged as the upper bound; spawn loop iterates 0..WORKER_COUNT-1. Bump back via WORKER_COUNT=8 when a heavy plan needs more parallel lanes. - worker_cmd_for() now exports NODE_OPTIONS=--max-old-space-size=400 so any Node MCP-server child codex spawns is capped. codex itself is native; this only affects Node helpers. worker-prompt.md - Added empty_streak counter to the worker loop. After IDLE_EXIT_THRESHOLD (default 5) consecutive empty task_ready_for_agent polls (~5 min idle), the worker posts a Colony note and exits 0. Supervisor respawns it when Colony reports new claimable work. Override per-pane via IDLE_EXIT_THRESHOLD=0. Expected impact - workers at bringup: 8 → 4 - idle floor when plan exhausted: ~2 GB → ~0 - active-work peak: ~10 GB → ~5 GB - node MCP child heap cap: unbounded → 400 MB To activate, tear down + bring up the fleet: bash scripts/codex-fleet/down.sh bash scripts/codex-fleet/full-bringup.sh ... --- .../.openspec.yaml | 2 + .../notes.md | 50 +++++++++++++++++++ scripts/codex-fleet/codex-fleet-2.sh | 21 ++++++-- scripts/codex-fleet/worker-prompt.md | 17 ++++++- 4 files changed, 86 insertions(+), 4 deletions(-) create mode 100644 openspec/changes/agent-claude-halve-worker-pool-heap-cap-idle-worker-e-2026-05-18-14-38/.openspec.yaml create mode 100644 openspec/changes/agent-claude-halve-worker-pool-heap-cap-idle-worker-e-2026-05-18-14-38/notes.md diff --git a/openspec/changes/agent-claude-halve-worker-pool-heap-cap-idle-worker-e-2026-05-18-14-38/.openspec.yaml b/openspec/changes/agent-claude-halve-worker-pool-heap-cap-idle-worker-e-2026-05-18-14-38/.openspec.yaml new file mode 100644 index 0000000..231e3ab --- /dev/null +++ b/openspec/changes/agent-claude-halve-worker-pool-heap-cap-idle-worker-e-2026-05-18-14-38/.openspec.yaml @@ -0,0 +1,2 @@ +schema: spec-driven +created: 2026-05-18 diff --git a/openspec/changes/agent-claude-halve-worker-pool-heap-cap-idle-worker-e-2026-05-18-14-38/notes.md b/openspec/changes/agent-claude-halve-worker-pool-heap-cap-idle-worker-e-2026-05-18-14-38/notes.md new file mode 100644 index 0000000..dfd8b71 --- /dev/null +++ b/openspec/changes/agent-claude-halve-worker-pool-heap-cap-idle-worker-e-2026-05-18-14-38/notes.md @@ -0,0 +1,50 @@ +# agent-claude-halve-worker-pool-heap-cap-idle-worker-e-2026-05-18-14-38 (minimal / T1) + +Branch: `agent/claude/halve-worker-pool-heap-cap-idle-worker-e-2026-05-18-14-38` + +## Why + +The codex-fleet currently holds ~10 GB of resident memory (16 codex CLIs ++ 258 node helper procs, observed via `ps -C codex -o rss`). Each codex +CLI is a native binary with ~200-400 MB of heap that does NOT shrink +while idle. Even when the plan is `plan-exhausted` and every worker is +in `sleep 60`, the native heap stays resident. + +## What + +Three changes, all in `scripts/codex-fleet/`: + +1. `codex-fleet-2.sh` — worker count is now `WORKER_COUNT` env (default + **4**, was **8**). Spawn loop driven by the env. The full + `RESERVE_ACCOUNTS` array stays as the upper bound; bump by setting + `WORKER_COUNT=8` for heavy plans. +2. `codex-fleet-2.sh` — `worker_cmd_for()` now exports + `NODE_OPTIONS=--max-old-space-size=400` so any Node MCP-server child + codex spawns is capped. (codex itself is native; the flag does not + apply to its own heap.) +3. `worker-prompt.md` — added `empty_streak` counter to the worker + loop. After 5 consecutive `plan-exhausted` polls (~5 min idle), the + worker posts a Colony note and exits with status 0. Supervisor + respawns it when Colony reports new claimable work for the account. + Override per-pane via `IDLE_EXIT_THRESHOLD=0`. + +## Expected impact + +| Metric | Before | After | +| --- | --- | --- | +| Workers spawned at bringup | 8 | 4 | +| Idle floor when plan exhausted | ~2 GB | near 0 | +| Active-work peak | ~10 GB | ~5 GB | +| Node MCP child heap cap | unbounded | 400 MB | + +## Handoff + +- Handoff: change=`agent-claude-halve-worker-pool-heap-cap-idle-worker-e-2026-05-18-14-38`; branch=`agent/claude/halve-worker-pool-heap-cap-idle-worker-e-2026-05-18-14-38`; scope=`codex-fleet-2.sh + worker-prompt.md`; action=`finish via PR`. + +## Cleanup + +- [ ] Run: `gx branch finish --branch agent/claude/halve-worker-pool-heap-cap-idle-worker-e-2026-05-18-14-38 --base main --via-pr --wait-for-merge --cleanup` +- [ ] Tear down + bring up the fleet to pick up the new defaults: + `bash scripts/codex-fleet/down.sh && bash scripts/codex-fleet/full-bringup.sh ...` +- [ ] Record PR URL + `MERGED` state in the completion handoff. +- [ ] Confirm sandbox worktree is gone (`git worktree list`, `git branch -a`). diff --git a/scripts/codex-fleet/codex-fleet-2.sh b/scripts/codex-fleet/codex-fleet-2.sh index 612851a..ac0f6a8 100755 --- a/scripts/codex-fleet/codex-fleet-2.sh +++ b/scripts/codex-fleet/codex-fleet-2.sh @@ -143,20 +143,35 @@ RESERVE_ACCOUNTS=( admin-kollarrobert admin-mite bia-zazrifka fico-magnolia koncita-pipacs mesi-lebenyse recodee-mite ricsi-zazrifka ) +# Worker count is now parameterizable. Default halved from 8 → 4 to cut +# the codex-fleet-2 RSS floor by ~50% (each codex CLI holds ~200-400 MB +# of native heap that does not shrink while idle). Bump back via +# `WORKER_COUNT=8 bash codex-fleet-2.sh ...` when a heavy plan needs +# more parallel lanes; the array above caps the upper bound. +WORKER_COUNT="${WORKER_COUNT:-4}" +if (( WORKER_COUNT < 1 )); then WORKER_COUNT=1; fi +if (( WORKER_COUNT > ${#RESERVE_ACCOUNTS[@]} )); then + WORKER_COUNT=${#RESERVE_ACCOUNTS[@]} +fi worker_cmd_for() { local acct="$1" # Launch codex directly as the pane command (matches codex-fleet:overview's # pattern in scripts/codex-fleet/full-bringup.sh). codex inherits the pane's # TTY cleanly because we skip the bash-lc indirection. The guard wrapper # (codex-guard.sh) sees CODEX_GUARD_BYPASS=1 and execs the real codex. - printf 'env CODEX_GUARD_BYPASS=1 CODEX_HOME=/tmp/codex-fleet/%s CODEX_FLEET_AGENT_NAME=codex-fleet-2-%s CODEX_FLEET_ACCOUNT=%s CODEX_FLEET_SESSION=%s codex --dangerously-bypass-approvals-and-sandbox --add-dir /home/deadpool/Documents/codex-fleet --add-dir /home/deadpool/Documents/codex-fleetui' \ + # + # NODE_OPTIONS=--max-old-space-size=400 caps any Node MCP-server child the + # codex binary spawns. codex itself is a native binary so the V8 flag + # does not apply to its own heap, but it keeps helper Node processes + # from growing unbounded. + printf 'env CODEX_GUARD_BYPASS=1 NODE_OPTIONS=--max-old-space-size=400 CODEX_HOME=/tmp/codex-fleet/%s CODEX_FLEET_AGENT_NAME=codex-fleet-2-%s CODEX_FLEET_ACCOUNT=%s CODEX_FLEET_SESSION=%s codex --dangerously-bypass-approvals-and-sandbox --add-dir /home/deadpool/Documents/codex-fleet --add-dir /home/deadpool/Documents/codex-fleetui' \ "$acct" "$acct" "$acct" "$SESSION" } -# Force a generous virtual size so 8 worker splits have room before the +# Force a generous virtual size so the worker splits have room before the # kitty client attaches. tmux resizes to the client on attach anyway. tmux new-session -d -s "$SESSION" -x 274 -y 78 -n overview \ "$(worker_cmd_for "${RESERVE_ACCOUNTS[0]}")" -for i in 1 2 3 4 5 6 7; do +for (( i = 1; i < WORKER_COUNT; i++ )); do acct="${RESERVE_ACCOUNTS[$i]}" tmux split-window -t "$SESSION:overview" "$(worker_cmd_for "$acct")" >/dev/null 2>&1 || true tmux select-layout -t "$SESSION:overview" tiled >/dev/null diff --git a/scripts/codex-fleet/worker-prompt.md b/scripts/codex-fleet/worker-prompt.md index bc41bd0..e21aff8 100644 --- a/scripts/codex-fleet/worker-prompt.md +++ b/scripts/codex-fleet/worker-prompt.md @@ -89,18 +89,33 @@ the file-claim, gx, or PR-merge contracts in this prompt. ## Loop ``` +1. empty_streak = 0 # tracked per-pane; reset whenever ready.ready is non-empty 2. ready = mcp__colony__task_ready_for_agent({ agent: $CODEX_FLEET_AGENT_NAME, limit: 1 }) 3. if ready.ready is empty: + empty_streak += 1 + if empty_streak >= IDLE_EXIT_THRESHOLD (default 5, i.e. ~5 minutes idle): + task_post(kind:'note', content:'idle-exit: empty_streak=; supervisor will respawn on demand') + exit 0 # native heap is reclaimed; supervisor watcher respawns the pane + # only when Colony reports new claimable work for this account. if ready.next_action contains "rescue" or ready.next_tool == "rescue_stranded_scan": sleep 60 # claim-release-supervisor daemon owns rescue; do not loop on it else: sleep 60 goto 2 -4. task = ready.ready[0] +4. empty_streak = 0 + task = ready.ready[0] ``` Then preflight, claim, work, report. Sequence below. +**Why the idle-exit.** Each codex CLI holds ~200-400 MB of native heap that +does not shrink while idle. Leaving 8 workers spinning at `sleep 60` keeps +~2-3 GB of RSS resident even when no plan is claimable. Exiting after 5 +consecutive empty polls (~5 min idle) drops the floor to active workers +only; the supervisor / `claim-release-supervisor.sh` respawns the pane on +the next Colony work signal. Override with `IDLE_EXIT_THRESHOLD=0` in the +pane env to disable the auto-exit for that one pane. + ### Tier + specialty gate (REQUIRED before preflight) Read once at boot: `tier=$CODEX_FLEET_TIER` (default `high`),