Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-05-18
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# agent-claude-halve-worker-pool-heap-cap-idle-worker-e-2026-05-18-14-38 (minimal / T1)

Branch: `agent/claude/halve-worker-pool-heap-cap-idle-worker-e-2026-05-18-14-38`

## Why

The codex-fleet currently holds ~10 GB of resident memory (16 codex CLIs
+ 258 node helper procs, observed via `ps -C codex -o rss`). Each codex
CLI is a native binary with ~200-400 MB of heap that does NOT shrink
while idle. Even when the plan is `plan-exhausted` and every worker is
in `sleep 60`, the native heap stays resident.

## What

Three changes, all in `scripts/codex-fleet/`:

1. `codex-fleet-2.sh` — worker count is now `WORKER_COUNT` env (default
**4**, was **8**). Spawn loop driven by the env. The full
`RESERVE_ACCOUNTS` array stays as the upper bound; bump by setting
`WORKER_COUNT=8` for heavy plans.
2. `codex-fleet-2.sh` — `worker_cmd_for()` now exports
`NODE_OPTIONS=--max-old-space-size=400` so any Node MCP-server child
codex spawns is capped. (codex itself is native; the flag does not
apply to its own heap.)
3. `worker-prompt.md` — added `empty_streak` counter to the worker
loop. After 5 consecutive `plan-exhausted` polls (~5 min idle), the
worker posts a Colony note and exits with status 0. Supervisor
respawns it when Colony reports new claimable work for the account.
Override per-pane via `IDLE_EXIT_THRESHOLD=0`.

## Expected impact

| Metric | Before | After |
| --- | --- | --- |
| Workers spawned at bringup | 8 | 4 |
| Idle floor when plan exhausted | ~2 GB | near 0 |
| Active-work peak | ~10 GB | ~5 GB |
| Node MCP child heap cap | unbounded | 400 MB |

## Handoff

- Handoff: change=`agent-claude-halve-worker-pool-heap-cap-idle-worker-e-2026-05-18-14-38`; branch=`agent/claude/halve-worker-pool-heap-cap-idle-worker-e-2026-05-18-14-38`; scope=`codex-fleet-2.sh + worker-prompt.md`; action=`finish via PR`.

## Cleanup

- [ ] Run: `gx branch finish --branch agent/claude/halve-worker-pool-heap-cap-idle-worker-e-2026-05-18-14-38 --base main --via-pr --wait-for-merge --cleanup`
- [ ] Tear down + bring up the fleet to pick up the new defaults:
`bash scripts/codex-fleet/down.sh && bash scripts/codex-fleet/full-bringup.sh ...`
- [ ] Record PR URL + `MERGED` state in the completion handoff.
- [ ] Confirm sandbox worktree is gone (`git worktree list`, `git branch -a`).
21 changes: 18 additions & 3 deletions scripts/codex-fleet/codex-fleet-2.sh
Original file line number Diff line number Diff line change
Expand Up @@ -143,20 +143,35 @@ RESERVE_ACCOUNTS=(
admin-kollarrobert admin-mite bia-zazrifka fico-magnolia
koncita-pipacs mesi-lebenyse recodee-mite ricsi-zazrifka
)
# Worker count is now parameterizable. Default halved from 8 → 4 to cut
# the codex-fleet-2 RSS floor by ~50% (each codex CLI holds ~200-400 MB
# of native heap that does not shrink while idle). Bump back via
# `WORKER_COUNT=8 bash codex-fleet-2.sh ...` when a heavy plan needs
# more parallel lanes; the array above caps the upper bound.
WORKER_COUNT="${WORKER_COUNT:-4}"
if (( WORKER_COUNT < 1 )); then WORKER_COUNT=1; fi
if (( WORKER_COUNT > ${#RESERVE_ACCOUNTS[@]} )); then
WORKER_COUNT=${#RESERVE_ACCOUNTS[@]}
fi
worker_cmd_for() {
local acct="$1"
# Launch codex directly as the pane command (matches codex-fleet:overview's
# pattern in scripts/codex-fleet/full-bringup.sh). codex inherits the pane's
# TTY cleanly because we skip the bash-lc indirection. The guard wrapper
# (codex-guard.sh) sees CODEX_GUARD_BYPASS=1 and execs the real codex.
printf 'env CODEX_GUARD_BYPASS=1 CODEX_HOME=/tmp/codex-fleet/%s CODEX_FLEET_AGENT_NAME=codex-fleet-2-%s CODEX_FLEET_ACCOUNT=%s CODEX_FLEET_SESSION=%s codex --dangerously-bypass-approvals-and-sandbox --add-dir /home/deadpool/Documents/codex-fleet --add-dir /home/deadpool/Documents/codex-fleetui' \
#
# NODE_OPTIONS=--max-old-space-size=400 caps any Node MCP-server child the
# codex binary spawns. codex itself is a native binary so the V8 flag
# does not apply to its own heap, but it keeps helper Node processes
# from growing unbounded.
printf 'env CODEX_GUARD_BYPASS=1 NODE_OPTIONS=--max-old-space-size=400 CODEX_HOME=/tmp/codex-fleet/%s CODEX_FLEET_AGENT_NAME=codex-fleet-2-%s CODEX_FLEET_ACCOUNT=%s CODEX_FLEET_SESSION=%s codex --dangerously-bypass-approvals-and-sandbox --add-dir /home/deadpool/Documents/codex-fleet --add-dir /home/deadpool/Documents/codex-fleetui' \
"$acct" "$acct" "$acct" "$SESSION"
}
# Force a generous virtual size so 8 worker splits have room before the
# Force a generous virtual size so the worker splits have room before the
# kitty client attaches. tmux resizes to the client on attach anyway.
tmux new-session -d -s "$SESSION" -x 274 -y 78 -n overview \
"$(worker_cmd_for "${RESERVE_ACCOUNTS[0]}")"
for i in 1 2 3 4 5 6 7; do
for (( i = 1; i < WORKER_COUNT; i++ )); do
acct="${RESERVE_ACCOUNTS[$i]}"
tmux split-window -t "$SESSION:overview" "$(worker_cmd_for "$acct")" >/dev/null 2>&1 || true
tmux select-layout -t "$SESSION:overview" tiled >/dev/null
Expand Down
17 changes: 16 additions & 1 deletion scripts/codex-fleet/worker-prompt.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,18 +89,33 @@ the file-claim, gx, or PR-merge contracts in this prompt.
## Loop

```
1. empty_streak = 0 # tracked per-pane; reset whenever ready.ready is non-empty
2. ready = mcp__colony__task_ready_for_agent({ agent: $CODEX_FLEET_AGENT_NAME, limit: 1 })
3. if ready.ready is empty:
empty_streak += 1
if empty_streak >= IDLE_EXIT_THRESHOLD (default 5, i.e. ~5 minutes idle):
task_post(kind:'note', content:'idle-exit: empty_streak=<n>; supervisor will respawn on demand')
exit 0 # native heap is reclaimed; supervisor watcher respawns the pane
# only when Colony reports new claimable work for this account.
if ready.next_action contains "rescue" or ready.next_tool == "rescue_stranded_scan":
sleep 60 # claim-release-supervisor daemon owns rescue; do not loop on it
else:
sleep 60
goto 2
4. task = ready.ready[0]
4. empty_streak = 0
task = ready.ready[0]
```

Then preflight, claim, work, report. Sequence below.

**Why the idle-exit.** Each codex CLI holds ~200-400 MB of native heap that
does not shrink while idle. Leaving 8 workers spinning at `sleep 60` keeps
~2-3 GB of RSS resident even when no plan is claimable. Exiting after 5
consecutive empty polls (~5 min idle) drops the floor to active workers
only; the supervisor / `claim-release-supervisor.sh` respawns the pane on
the next Colony work signal. Override with `IDLE_EXIT_THRESHOLD=0` in the
pane env to disable the auto-exit for that one pane.

### Tier + specialty gate (REQUIRED before preflight)

Read once at boot: `tier=$CODEX_FLEET_TIER` (default `high`),
Expand Down