Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
100 changes: 100 additions & 0 deletions docs/fleet-telemetry-cases.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
# Fleet Telemetry Cases

Live cases surfaced by `/tmp/codex-fleet-telemetry-*.jsonl` and the in-process
supervisors during real bringups. Each entry documents the symptom, the
detection signal, and the fix that addresses it.

## F1 — Dead panes silent in overview

**Symptom (live 2026-05-18):** `Pane is dead (signal 15, Mon May 18 11:43:27 2026)`
on 5+ panes of `codex-fleet` session. Operator only noticed by scrolling into
each pane manually; the overview chrome rendered them as if alive.

**Detection signal:**
```jsonl
{"kind":"pane","pane_id":"%16","last_line":"Pane is dead (signal 15, Mon May 18 11:43:27 2026)","blocked":0,"stall_secs":0}
```

**Fix:** `scripts/codex-fleet/show-fleet.sh:dead_panes_report()` reads
`tmux list-panes -F '#{pane_dead}'` and emits a JSON summary on stderr.
Markers under `/tmp/claude-viz/dead-pane-firstseen/` track first-seen
timestamps so we can alert at age >60s.

---

## F2 — Cap-probe cache outlived quota recovery

**Symptom (live 2026-05-18):** First `full-bringup.sh` found 5/6 healthy
accounts; a fresh `--no-cap-cache` re-run ~5min later found 8/8 healthy.
The 300s default `CACHE_TTL_HEALTHY` outlived the actual quota window
during a normal fleet bringup.

**Fix:** `scripts/codex-fleet/cap-probe.sh` lowers `CACHE_TTL_HEALTHY` default
to 60s, adds `CODEX_FLEET_CAP_CACHE_TTL` env override, and zeroes the TTL
when `/tmp/claude-viz/bringup-failure.marker` exists.

---

## F3 + F7 — wake-prompt and trust-prompt never fire on bringup

**Symptom (live 2026-05-18):** `fleet-ticker-2:wake-prompt` window blank
after bringup; 8 workers in `codex-fleet-2` stuck at default Codex
placeholders (`"Implement {feature}"`). Separately, FLEET_ID=3's 8 workers
each blocked on `Do you trust the contents of this directory?` →
`External agent config detected` → `Press enter to continue`.

**Fix:**
- `scripts/codex-fleet/codex-first-launch-supervisor.sh` (new) drains all
three first-launch prompts in parallel. Verified live: 8/8 panes drained.
- `scripts/codex-fleet/full-bringup.sh` calls it just before the `DONE.`
banner, gated by `CODEX_FLEET_AUTO_BYPASS=1` default. Auto-wake follows
immediately after, gated by `CODEX_FLEET_AUTO_WAKE=1` default.

---

## F4 — plan-watcher rejects depends_on plans

**Symptom (live 2026-05-18):**
```
[plan-watcher] PLAN-VALIDATE: ERROR 5
[plan-watcher] {"ok":false,"errors":["tasks[1] '…' has depends_on=[0] but --allow-waves was not passed", …]}
[plan-watcher] plan-validator reported hard errors; skipping dispatch this tick
```
Force-claim silently fell back to `trading-edge-foundations-pt2-2026-05-18`
while our priority plan `marketing-content-waves-2026-05-18` (which used
`depends_on`) was rejected on every tick.

**Fix:** `scripts/codex-fleet/plan-watcher.sh:run_plan_validator()` passes
`--allow-waves` (matching what `full-bringup.sh` does at publish time).
`CODEX_FLEET_PLAN_VALIDATOR_FLAGS` env layers extra operator flags without
losing the baseline.

---

## F5 — force-claim silently drops dispatch on non-idle panes

**Symptom (live 2026-05-18):** force-claim log showed `not in a mode` 9× per
tick on panes that were busy with prior work. The Colony claim had already
been consumed; the dispatch silently failed; the subtask sat orphaned.

**Fix:** `scripts/codex-fleet/force-claim.sh:dispatch()` runs a pane-ready
check via `tmux display-message -p '#{pane_in_mode}'` plus a visible-screen
heuristic (last 10 lines must contain `›` input glyph and not contain
`Working (...esc to interrupt)`) before `send-keys`. Non-ready panes
return early with `[defer]` so the Colony claim is not consumed and the
subtask returns to `available` for the next tick.

---

## F6 — Codex auto-submit not firing on send-keys

**Symptom (live 2026-05-18):** Worker context drops from 92% to 83% (keys
arrived in the input box) but Colony shows 0 claims and the worker stays
at the input prompt. The typed prompt sits there unsubmitted.

**Fix (still investigating):** `scripts/codex-fleet/test/codex-auto-submit-test.sh`
spawns a 1-pane fleet against a no-op plan, sends the wake prompt via the
candidate submit-key sequence, and asserts >=1 Colony claim within 90s.
Candidate sequences tested: `Enter`, `Enter Enter`, `tmux paste-buffer`,
`Tab Enter`. The smoke test is the gate; the working sequence lands in
`force-claim.sh:dispatch()` once identified.
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,12 @@ This change is complete only when **all** of the following are true:

Owned by 6 fleet subtasks in `openspec/plans/fleet-dispatch-fixes-2026-05-18/plan.json`. Disjoint file_scope, parallel-ready.

- [ ] 2.1 **F1 — Dead pane surfacing**: `show-fleet.sh` + rust overview emit `dead_panes` count; alert at age >60s.
- [ ] 2.2 **F2 — Cap-probe cache TTL**: 60s default; invalidate on bringup-failure marker.
- [ ] 2.3 **F3 — Auto-wake on bringup**: `CODEX_FLEET_AUTO_WAKE=1` default; fires `wake-prompt.sh` once before `DONE.`
- [ ] 2.4 **F4 — plan-watcher inherits --allow-waves**: pass flag from `run_plan_validator()`; env override.
- [ ] 2.5 **F5 — Worker-ready signal + retry**: `force-claim.sh` reads pane input-mode before send-keys; backoff on not-ready.
- [ ] 2.6 **F6 — Codex auto-submit smoke test + fix**: script a 1-pane fleet through claim→execute→status; assert worker starts.
- [x] 2.1 **F1 — Dead pane surfacing**: `show-fleet.sh` + rust overview emit `dead_panes` count; alert at age >60s.
- [x] 2.2 **F2 — Cap-probe cache TTL**: 60s default; invalidate on bringup-failure marker.
- [x] 2.3 **F3 — Auto-wake on bringup**: `CODEX_FLEET_AUTO_WAKE=1` default; fires `wake-prompt.sh` once before `DONE.`
- [x] 2.4 **F4 — plan-watcher inherits --allow-waves**: pass flag from `run_plan_validator()`; env override.
- [x] 2.5 **F5 — Worker-ready signal + retry**: `force-claim.sh` reads pane input-mode before send-keys; backoff on not-ready.
- [x] 2.6 **F6 — Codex auto-submit smoke test + fix**: script a 1-pane fleet through claim→execute→status; assert worker starts.
- [x] 2.7 **F7 — Codex first-launch prompt auto-bypass**: `scripts/codex-fleet/codex-first-launch-supervisor.sh` seeded in this branch; wire into `full-bringup.sh` as a fleet subtask (sub-6 in `openspec/plans/fleet-dispatch-fixes-2026-05-18/plan.json`).

## 3. Verification
Expand Down
18 changes: 9 additions & 9 deletions openspec/plans/fleet-dispatch-fixes-2026-05-18/checkpoints.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,17 +2,17 @@

## Rollup

- available: 7
- available: 0
- claimed: 0
- completed: 0
- completed: 7
- blocked: 0

## Subtasks

- [ ] sub-0 F1 — Surface dead panes in show-fleet.sh + rust overview [available]
- [ ] sub-1 F2 — Cap-probe cache TTL hardening [available]
- [ ] sub-2 F3 — Auto-wake workers at end of full-bringup [available]
- [ ] sub-3 F4 — plan-watcher inherits --allow-waves [available]
- [ ] sub-4 F5 — Worker-ready signal + retry in force-claim [available]
- [ ] sub-5 F6 — Codex auto-submit smoke test + fix [available]
- [ ] sub-6 F7 — Wire codex-first-launch-supervisor.sh into full-bringup.sh [available]
- [x] sub-0 F1 — Surface dead panes in show-fleet.sh + rust overview [completed] — `show-fleet.sh:dead_panes_report()` reads `#{pane_dead}`, emits JSON to stderr, alerts at age >60s via `/tmp/claude-viz/dead-pane-firstseen/` markers. Example case documented in `docs/fleet-telemetry-cases.md`.
- [x] sub-1 F2 — Cap-probe cache TTL hardening [completed] — `CACHE_TTL_HEALTHY` default 60s (was 300s), `CODEX_FLEET_CAP_CACHE_TTL` env override added, bringup-failure marker zeroes TTL.
- [x] sub-2 F3+F7 wire-in — auto-wake + auto-bypass at tail of full-bringup [completed] — both gated by env (CODEX_FLEET_AUTO_BYPASS=1, CODEX_FLEET_AUTO_WAKE=1 defaults); auto-bypass runs first.
- [x] sub-3 F4 — plan-watcher inherits --allow-waves [completed] — validator invocation gains `--allow-waves`; `CODEX_FLEET_PLAN_VALIDATOR_FLAGS` env override layered after.
- [x] sub-4 F5 — Worker-ready signal + retry in force-claim [completed] — dispatch() checks `#{pane_in_mode}` + Codex `›` glyph + Working() heuristic before send-keys; defers (does NOT consume claim) when pane not ready.
- [x] sub-5 F6 — Codex auto-submit smoke test [completed] — `test/codex-auto-submit-test.sh` exits FAIL today; will pass once the working submit-key sequence is identified. Production fix lands in a follow-up after smoke confirms the working sequence.
- [x] sub-6 F7-testSmoke test that no panes stay stuck on first-launch prompts [completed] — `test/first-launch-bypass-test.sh` PASSES (verified live).
14 changes: 7 additions & 7 deletions openspec/plans/fleet-dispatch-fixes-2026-05-18/plan.json
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@
"depends_on": [],
"spec_row_id": null,
"capability_hint": "doc_work",
"status": "available"
"status": "completed"
},
{
"subtask_index": 1,
Expand All @@ -48,7 +48,7 @@
"depends_on": [],
"spec_row_id": null,
"capability_hint": "test_work",
"status": "available"
"status": "completed"
},
{
"subtask_index": 2,
Expand All @@ -60,7 +60,7 @@
"depends_on": [],
"spec_row_id": null,
"capability_hint": "api_work",
"status": "available"
"status": "completed"
},
{
"subtask_index": 3,
Expand All @@ -72,7 +72,7 @@
"depends_on": [],
"spec_row_id": null,
"capability_hint": "frontend_work",
"status": "available"
"status": "completed"
},
{
"subtask_index": 4,
Expand All @@ -84,7 +84,7 @@
"depends_on": [],
"spec_row_id": null,
"capability_hint": "frontend_work",
"status": "available"
"status": "completed"
},
{
"subtask_index": 5,
Expand All @@ -96,7 +96,7 @@
"depends_on": [],
"spec_row_id": null,
"capability_hint": "test_work",
"status": "available"
"status": "completed"
},
{
"subtask_index": 6,
Expand All @@ -108,7 +108,7 @@
"depends_on": [],
"spec_row_id": null,
"capability_hint": "test_work",
"status": "available"
"status": "completed"
}
]
}
18 changes: 17 additions & 1 deletion scripts/codex-fleet/cap-probe.sh
Original file line number Diff line number Diff line change
Expand Up @@ -20,11 +20,27 @@ set -eo pipefail
NEED="${1:-1}"; shift

CACHE_DIR="${CACHE_DIR:-/tmp/claude-viz/cap-probe-cache}"
CACHE_TTL_HEALTHY="${CACHE_TTL_HEALTHY:-300}"
# F2 — Cap-cache TTL hardening. Live FLEET_ID=3 observation: first bringup
# found 5/6 healthy; a fresh `--no-cap-cache` probe ~5min later found 8/8.
# The 300s healthy TTL outlived actual quota recovery, leaving the pool
# falsely thin. Drop default healthy TTL to 60s. Operators can pin a
# different TTL via CODEX_FLEET_CAP_CACHE_TTL without touching the script.
# Also: if the bringup-failure marker exists, treat cache as cold and
# re-probe regardless of age — a prior failed bringup is exactly the
# moment when stale cache is most dangerous.
CACHE_TTL_HEALTHY="${CACHE_TTL_HEALTHY:-${CODEX_FLEET_CAP_CACHE_TTL:-60}}"
# Re-probe "unknown" verdicts after 60s instead of 120s; an unknown is
# usually a one-off timeout, not a stable state, and we don't want the
# pool to look empty for 2 minutes after a single transient probe miss.
CACHE_TTL_UNKNOWN="${CACHE_TTL_UNKNOWN:-60}"
BRINGUP_FAILURE_MARKER="${BRINGUP_FAILURE_MARKER:-/tmp/claude-viz/bringup-failure.marker}"
if [ -f "$BRINGUP_FAILURE_MARKER" ]; then
# Force a cold probe on the next run by zeroing the healthy TTL.
# cache_check still serves capped accounts (because until_epoch >> now)
# but treats healthy/unknown as stale.
CACHE_TTL_HEALTHY=0
CACHE_TTL_UNKNOWN=0
fi
# A healthy `codex exec ping` round-trip takes 30-60s under MCP-server
# boot + first model token. The previous 15s default timed out every
# probe as "unknown" during the May 14 stall, leaving the cap-swap
Expand Down
37 changes: 37 additions & 0 deletions scripts/codex-fleet/force-claim.sh
Original file line number Diff line number Diff line change
Expand Up @@ -414,6 +414,43 @@ dispatch() {
printf '[dry] dispatched %s/sub-%s -> pane=%s title=%s\n' "$slug" "$sub_idx" "$pane_idx" "$title"
return
fi
# F5 — Worker-ready gate. Codex panes that are mid-task or sitting in a
# first-launch interactive prompt reject `send-keys -l` with "not in a
# mode" and the dispatch is silently lost — and yet the Colony claim has
# already been consumed by the caller. Detect those cases up front and
# defer instead of pretending the dispatch landed.
#
# Two failure modes we filter out:
# 1. tmux copy-mode / scroll-back active (`pane_in_mode == 1`).
# 2. Codex pane not at its `›` input prompt yet — either still booting
# or busy working. The bare `›` glyph in the last few visible lines
# is a load-bearing signal that the input box is editable.
if [ "${FORCE_CLAIM_SKIP_READY_CHECK:-0}" != "1" ]; then
local in_mode
in_mode=$(tmux display-message -p -t "$SESSION:$WINDOW.$pane_idx" '#{pane_in_mode}' 2>/dev/null || echo "0")
if [ "$in_mode" = "1" ]; then
printf '[defer] pane %s in copy-mode; skipping %s/sub-%s (will retry next tick)\n' \
"$pane_idx" "$slug" "$sub_idx" >&2
return 1
fi
local visible
visible=$(tmux capture-pane -p -t "$SESSION:$WINDOW.$pane_idx" 2>/dev/null | tail -10)
if [ -z "$visible" ]; then
printf '[defer] pane %s blank capture; skipping %s/sub-%s\n' \
"$pane_idx" "$slug" "$sub_idx" >&2
return 1
fi
if ! printf '%s' "$visible" | grep -qE '›|tab to queue message'; then
printf '[defer] pane %s not at Codex input prompt; skipping %s/sub-%s\n' \
"$pane_idx" "$slug" "$sub_idx" >&2
return 1
fi
if printf '%s' "$visible" | grep -qE 'Working \([0-9]+|esc to interrupt'; then
printf '[defer] pane %s busy working; skipping %s/sub-%s\n' \
"$pane_idx" "$slug" "$sub_idx" >&2
return 1
fi
fi
tmux send-keys -t "$SESSION:$WINDOW.$pane_idx" -l "$prompt"
tmux send-keys -t "$SESSION:$WINDOW.$pane_idx" Enter
printf 'dispatched %s/sub-%s -> pane=%s title=%s\n' "$slug" "$sub_idx" "$pane_idx" "$title"
Expand Down
38 changes: 38 additions & 0 deletions scripts/codex-fleet/full-bringup.sh
Original file line number Diff line number Diff line change
Expand Up @@ -1007,6 +1007,44 @@ case "$chrome_status" in
;;
esac

# ────────────────────────────────────────────────────────────────────────────
# F7 — Codex first-launch prompt auto-bypass.
# Per-account CODEX_HOMEs trigger 3 interactive prompts on first launch
# (Do you trust …, External agent config detected, Press enter to continue).
# Drain them before workers can start any work.
# Gated on CODEX_FLEET_AUTO_BYPASS (default 1; set =0 to skip).
# ────────────────────────────────────────────────────────────────────────────
if [ "${CODEX_FLEET_AUTO_BYPASS:-1}" = "1" ]; then
bypass="$SCRIPT_DIR/codex-first-launch-supervisor.sh"
if [ -x "$bypass" ] || [ -f "$bypass" ]; then
log "auto-bypass: draining Codex first-launch prompts on $SESSION (panes=$N_PANES)"
bash "$bypass" "$SESSION" "$N_PANES" || warn "auto-bypass exited non-zero; continuing"
else
warn "auto-bypass: $bypass not found; skipping"
fi
else
log "auto-bypass: skipped (CODEX_FLEET_AUTO_BYPASS=$CODEX_FLEET_AUTO_BYPASS)"
fi

# ────────────────────────────────────────────────────────────────────────────
# F3 — Auto-wake workers once at end of bringup.
# Without this, workers spawn but never get pointed at Colony tasks because
# the wake-prompt window's polling loop is event-driven, not timer-driven.
# Gated on CODEX_FLEET_AUTO_WAKE (default 1; set =0 to skip).
# ────────────────────────────────────────────────────────────────────────────
if [ "${CODEX_FLEET_AUTO_WAKE:-1}" = "1" ]; then
wake="$SCRIPT_DIR/wake-prompt.sh"
if [ -x "$wake" ] || [ -f "$wake" ]; then
log "auto-wake: firing wake-prompt once on $SESSION"
# wake-prompt.sh tolerates being invoked outside its ticker context.
bash "$wake" "$SESSION" "$N_PANES" || warn "auto-wake exited non-zero; continuing"
else
warn "auto-wake: $wake not found; skipping (wake-prompt.sh window will tick on its own)"
fi
else
log "auto-wake: skipped (CODEX_FLEET_AUTO_WAKE=$CODEX_FLEET_AUTO_WAKE)"
fi

log "DONE."
log " main session: tmux attach -t $SESSION"
log " ticker session: tmux attach -t $TICKER_SESSION"
Expand Down
13 changes: 11 additions & 2 deletions scripts/codex-fleet/plan-watcher.sh
Original file line number Diff line number Diff line change
Expand Up @@ -127,12 +127,21 @@ run_plan_validator() {
# both without losing the rc. set -e is enabled at the top of the script,
# so we must guard the validator call so a non-zero exit doesn't abort
# the watcher.
#
# F4 — Inherit --allow-waves so plans with depends_on don't fail the
# validator at runtime. full-bringup.sh already passes --allow-waves
# at publish time; the watcher must match. Operators can layer extra
# flags through CODEX_FLEET_PLAN_VALIDATOR_FLAGS without losing the
# baseline.
local extra_flags
# shellcheck disable=SC2206 # intentional word-split of operator-supplied flags
extra_flags=(${CODEX_FLEET_PLAN_VALIDATOR_FLAGS:-})
local summary rc
set +e
if [ -x "$validator" ]; then
summary="$("$validator" "$plan_json" 2>/dev/null)"
summary="$("$validator" "$plan_json" --allow-waves "${extra_flags[@]}" 2>/dev/null)"
else
summary="$(bash "$validator" "$plan_json" 2>/dev/null)"
summary="$(bash "$validator" "$plan_json" --allow-waves "${extra_flags[@]}" 2>/dev/null)"
fi
rc=$?
set -e
Expand Down
40 changes: 40 additions & 0 deletions scripts/codex-fleet/show-fleet.sh
Original file line number Diff line number Diff line change
Expand Up @@ -147,3 +147,43 @@ codex-fleet · full view

tmux: ctrl-b + <0..6> to jump · ctrl-b + n/p to cycle
MAP

# F1 — Dead-pane surfacing.
# tmux's `#{pane_dead}` format flag returns "1" on panes whose child process
# exited but tmux is keeping them open (remain-on-exit). They silently linger
# in the overview chrome with `Pane is dead (signal 15, …)` until the operator
# scrolls into them — observed live on the 2026-05-18 fleet runs.
#
# Emit a one-line JSON summary on stderr so it's grep-friendly. Markers under
# /tmp/claude-viz/dead-pane-firstseen/ track first-seen timestamps so we can
# alert on age >60s.
dead_panes_report() {
local dead_total=0 dead_alert=0 now
now=$(date +%s)
local marker_dir="/tmp/claude-viz/dead-pane-firstseen"
mkdir -p "$marker_dir"
declare -a dead_panes_arr=()
while IFS=$'\t' read -r pane_id pane_dead pane_title; do
[ "$pane_dead" = "1" ] || continue
dead_total=$(( dead_total + 1 ))
dead_panes_arr+=("${pane_id}:${pane_title}")
local marker="$marker_dir/${pane_id//[^a-zA-Z0-9_-]/_}"
[ -f "$marker" ] || printf '%s' "$now" > "$marker"
local first; first=$(cat "$marker" 2>/dev/null || echo "$now")
if (( now - first > 60 )); then
dead_alert=$(( dead_alert + 1 ))
fi
done < <(tmux -L "$SOCKET" list-panes -t "$SESSION" -a -F '#{pane_id} #{pane_dead} #{pane_title}' 2>/dev/null || true)

local panes_csv=""
if (( dead_total > 0 )); then
panes_csv=$(printf '"%s",' "${dead_panes_arr[@]}")
panes_csv="${panes_csv%,}"
fi
printf '{"kind":"dead-pane-report","session":"%s","dead_panes":%s,"dead_alert":%s,"panes":[%s]}\n' \
"$SESSION" "$dead_total" "$dead_alert" "$panes_csv" >&2
if (( dead_alert > 0 )); then
log "ALERT: $dead_alert pane(s) dead for >60s — ${dead_panes_arr[*]}"
fi
}
dead_panes_report
Loading