Skip to content

scaffold: fleet-dispatch-fixes — 6 findings from marketing-waves telemetry#189

Merged
NagyVikt merged 4 commits into
mainfrom
agent/claude/cfui-dispatch-improvements-zzz-2026-05-1-2026-05-18-14-03
May 18, 2026
Merged

scaffold: fleet-dispatch-fixes — 6 findings from marketing-waves telemetry#189
NagyVikt merged 4 commits into
mainfrom
agent/claude/cfui-dispatch-improvements-zzz-2026-05-1-2026-05-18-14-03

Conversation

@NagyVikt
Copy link
Copy Markdown
Contributor

@NagyVikt NagyVikt commented May 18, 2026

Summary

Scaffolds the OpenSpec change + 7-subtask plan workspace for fleet-dispatch-fixes-2026-05-18. Captures dispatch defects surfaced by real telemetry from the 2026-05-18 fleet bringups (recodee + codex-fleetui) plus a NEW finding F7 surfaced during the live FLEET_ID=3 bringup of this very plan.

The 7 findings

# Finding Evidence
F1 Stale dead panes silent in overview Pane is dead (signal 15) on 5+ panes; never surfaced
F2 cap-probe cache outlives quota recovery 5/6 healthy → 8/8 on fresh probe 5min later
F3 wake-prompt window blank on bringup Workers idle at default Codex placeholders
F4 plan-watcher rejects depends_on without --allow-waves Silent fallback to next plan; observed trading-edge-foundations dispatched in place of priority plan
F5 force-claim "not in a mode" on non-idle panes 9× per tick; dispatch silently dropped
F6 Codex auto-submit not firing on send-keys Context drops but Colony shows 0 claims
F7 Codex first-launch prompts block all workers All 8 of FLEET_ID=3 stalled on "Do you trust …" → "External agent config" → "Press enter to continue"; operator had to click each pane until I shipped codex-first-launch-supervisor.sh (already seeded in this branch, auto-drains all three prompts in parallel; 6/8 cleared in 30s)

Subtask split (disjoint file_scope, all parallel-ready)

Subtask Touches Hint
sub-0 (F1) show-fleet.sh, docs/fleet-telemetry-cases.md doc_work
sub-1 (F2) cap-probe.sh, cap-probe-cache.sh test_work
sub-2 (F3+F7 wire-in) full-bringup.sh (both auto-wake AND auto-bypass at tail) api_work
sub-3 (F4) plan-watcher.sh frontend_work
sub-4 (F5) force-claim.sh frontend_work
sub-5 (F6) test/codex-auto-submit-test.sh test_work
sub-6 (F7 test) test/first-launch-bypass-test.sh test_work

The F7 supervisor script itself (scripts/codex-fleet/codex-first-launch-supervisor.sh) is already in this branch — proven working live on the FLEET_ID=3 workers. Sub-2 just wires it into the bringup tail.

Acceptance gate

After all 7 subtasks land, a clean re-run of full-bringup.sh --plan-slug fleet-dispatch-fixes-2026-05-18 --n 4 --auto-fleet-id --no-cap-cache must show:

  1. Zero panes stuck on first-launch prompts within 30s of DONE.
  2. >=4 Colony claims within 90s of DONE (vs current 0).

Operator layout

Three kitty windows on operator desktop:

  • fleet-dispatch-fixes (codex-fleet-3) — the 8 worker overview
  • fleet-dispatch-fixes ticker (fleet-ticker-3) — force-claim, plan-watcher, cap-swap, wake-prompt
  • operator shell (codex-fleetui) — empty shell at the repo root for ad-hoc commands

Test plan

  • openspec validate <change-id> --type change --strict passes
  • lib/plan-validator.sh openspec/plans/fleet-dispatch-fixes-2026-05-18/plan.json returns ok:true
  • F7 supervisor verified live: 6/8 workers auto-drained from first-launch prompts
  • Each remaining subtask lands its own focused test under scripts/codex-fleet/test/
  • Integration gate passes (zero stuck panes, >=4 claims in 90s)

Draft until the fleet has produced its work.

🤖 Generated with Claude Code

NagyVikt and others added 4 commits May 18, 2026 14:06
…c change

Captures 6 dispatch-path defects surfaced by real telemetry from the
2026-05-18 marketing-content-waves fleet against recodee:

F1 — stale dead panes lingering silently in overview chrome
F2 — cap-probe cache outliving quota recovery (5/6 healthy → 8/8 on
     fresh probe 5min later)
F3 — wake-prompt window blank on bringup; workers idle at default Codex
     placeholders ("Implement {feature}", "Find and fix a bug in @filename")
F4 — plan-watcher re-validates without --allow-waves, silently falls back
     to next-priority plan when our plan has depends_on (observed:
     trading-edge-foundations dispatched while our priority plan skipped)
F5 — force-claim "not in a mode" on non-idle Codex panes, drops dispatch
     with no retry/backoff
F6 — Codex auto-submit not firing on send-keys: context drops (text
     arrived) but no Colony claim recorded — likely needs different
     terminator key sequence

Plan workspace at openspec/plans/fleet-dispatch-fixes-2026-05-18/ has 6
parallel-ready subtasks (disjoint file_scope, no depends_on so plan-
watcher accepts without --allow-waves until F4 lands). Each subtask
ships with a focused test under scripts/codex-fleet/test/.

This PR is the scaffold; implementation comes from the fleet itself
(separate per-subtask PRs from the fleet, then one squashed integration
PR per the OpenSpec change tasks.md verification gates).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Live FLEET_ID=3 bringup surfaced bug #7: per-account CODEX_HOMEs under
/tmp/codex-fleet/<acct> trigger Codex CLI first-launch prompts
("Do you trust …", "External agent config detected", "Press enter to
continue") that block worker bootstrap before the input box exists.
All 8 workers stalled; operator had to click each pane.

This commit:
- Seeds scripts/codex-fleet/codex-first-launch-supervisor.sh that
  polls each worker pane and auto-answers the three prompts. Verified
  working live (6/8 panes drained automatically; remaining 2 need
  slight backoff tuning).
- Expands plan subtask sub-2 (F3) to wire BOTH auto-wake and
  auto-bypass into full-bringup.sh tail, gated by
  CODEX_FLEET_AUTO_WAKE / CODEX_FLEET_AUTO_BYPASS env (default 1).
  Order: auto-bypass runs before auto-wake.
- Narrows sub-6 (F7) to ship a smoke test that asserts no panes
  remain stuck on first-launch prompts within 30s of DONE.
- Adds matching acceptance criterion + proposal narrative for F7.

The two-pane operator layout (one kitty with the agents, one empty
operator shell) is now the default — see PR description.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Live run showed 6/8 panes drained at 1.5s; remaining 2 needed
~9-15s total but only got 7.5s of attempts (5 rounds × 1.5s).
2.5s × 5 = 12.5s window catches slow Codex bootstraps without
making fast cases noticeably slower.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Workers' Codex CLI echoes the prior menu text into tool-call history
once they reach the worker loop. The supervisor's grep against the
full scrollback (-S -100) saw the menu in history and falsely flagged
a fully-drained pane as still stuck.

Switching to bare `capture-pane -p` (live screen only) eliminates
the false positive while still catching live menus.

Also add `1` + Enter combo for the External-agent menu — some Codex
builds advance on bare digit, others need Enter to confirm; sending
both is harmless on already-advanced panes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@NagyVikt NagyVikt marked this pull request as ready for review May 18, 2026 12:36
@NagyVikt NagyVikt merged commit 9a3774e into main May 18, 2026
3 checks passed
NagyVikt added a commit that referenced this pull request May 18, 2026
Implements all 6+1 dispatch-path fixes from PR #189 (scaffold) with live evidence from the 2026-05-18 fleet runs.

F1 dead-pane surfacing · F2 cap-probe TTL hardening · F3 auto-wake at bringup · F4 plan-watcher --allow-waves · F5 force-claim worker-ready gate · F6 auto-submit smoke test · F7 first-launch supervisor wired in (smoke test PASSES live).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant