feat: per-job event stream + watchdog + compact recovery (1.1.0)#335
feat: per-job event stream + watchdog + compact recovery (1.1.0)#335suminerProxy wants to merge 7 commits into
Conversation
Host plugin runtimes (e.g. Claude Code) inject CLAUDE_PLUGIN_DATA and
CODEX_COMPANION_SESSION_ID into the plugin process. When these leak into
`npm test`, two tests fail with non-bug assertions:
- tests/state.test.mjs: "resolveStateDir uses a temp-backed per-workspace
directory" — CLAUDE_PLUGIN_DATA redirects state to ~/.claude/plugins/data
instead of os.tmpdir().
- tests/runtime.test.mjs: "result without a job id ..." — fixture writer
and the spawned `result` subprocess look at different state dirs.
CI passes (no host-runtime env), so this only bites contributors running
`npm test` from inside a plugin host. Fix: delete these env vars at the top
of tests/helpers.mjs (the shared import) so every test gets a clean slate.
Tests that exercise the env var explicitly ("resolveStateDir uses
CLAUDE_PLUGIN_DATA when it is provided") already use try/finally to set and
restore it, so they continue to pass.
Verified: 86/86 tests pass with the polluting env vars still set in the
parent shell.
Add resolveJobEventsFile / appendJobEvent / readJobEvents for the upcoming
Claude-main-loop observability work (see DESIGN doc at user planning area).
Contract:
- Per-job events file at {stateDir}/jobs/{jobId}.events.ndjson.
- appendJobEvent caps each line at 4KB. POSIX `write(fd, buf, n)` with
O_APPEND is atomic when n < PIPE_BUF (~4096 on Linux/macOS), so single-
write append is safe across concurrent readers. Oversized lines first
truncate the `raw` field; if still too big, the whole event is replaced
with a minimal {type: "oversize-event-elided", seq, ts, method, phase}.
- readJobEvents returns [] for missing files. Partial last line (writer
mid-write before the trailing \n) is tolerated — JSON.parse failure on
the trailing line is skipped, next read picks it up. afterSeq takes
precedence over since when both are passed.
This commit is consumer-less; the producer side (captureTurn onNotification
hook) lands in the next commit so this slice can ship green independently.
Tests: 10 new unit tests covering path resolution, append+read, missing
file, afterSeq filter, since filter, afterSeq precedence over since, limit,
partial-line tolerance, raw truncation, and oversize elision. 96/96 pass.
Three changes to codex.mjs, building on the per-job NDJSON event API:
1. New exported `normalizeNotification(state, message)` function — pure
transform from app-server notifications to the flat event shape that
gets appended to {jobId}.events.ndjson. Covers thread/started,
thread/name/updated, turn/started, item/started, item/completed,
error, turn/completed, and an "unknown" fallback for forward-compat
methods (e.g. thread/compact/started). Phase inference reuses the
existing describeStartedItem / describeCompletedItem maps so the
vocabulary stays consistent with on-screen progress text.
2. `captureTurn` accepts an `options.onNotification` callback. A small
`dispatch(message)` wrapper applies state mutation first, then emits
the normalized event. Order is load-bearing: normalize reads
state.threadTurnIds populated by applyTurnNotification. onNotification
errors are swallowed so a broken consumer can never crash the worker.
3. `runAppServerTurn` forwards `options.onNotification` to captureTurn
and surfaces `usage` (from turnState.finalTurn?.usage) as a top-level
field on the result, so /codex:status and the events stream can read
token usage without traversing the nested `turn` payload.
Tests: 13 new unit tests covering each method branch, phase inference
edge cases (commandExecution -> running vs verifying), unknown-method
fallback, and malformed-input resilience. 109/109 pass.
No behavioral change for callers that don't pass onNotification — the
hook is fully opt-in. The companion-side wiring lands separately.
End-to-end wiring of the observability path landed in the prior two commits
(state.mjs events API + codex.mjs notification hook).
In codex-companion.mjs:
- executeTaskRun: forward `onNotification` through to runAppServerTurn.
Sync foreground callers still don't pass one; only the background
task-worker path opts in.
- handleTaskWorker: build a same-process closure that
(a) appends every normalized notification to {jobId}.events.ndjson
with a monotonic per-job seq number;
(b) reflects phase transitions into state.json via upsertJob (only on
real phase change, to keep the unlocked single-flight writer rare);
(c) runs a 5s stall watchdog — when the gap since the last event
exceeds CODEX_COMPANION_STALL_SECONDS (default 60s), it emits one
{type:"watchdog", phase:"stuck"} record and flips job phase to
stuck. It does NOT cancel. Main-loop Claude decides what to do
next (continue / compact / cancel).
(d) on terminal exit (success OR failure), emits a single
{type:"job/exited", phase:"completed"|"failed", exitCode, errorMessage}
record so a polling reader can distinguish "still slow" from
"already finished". The job.status state.json field is no longer
the sole source of truth — main-loop Claude must look for the
job/exited event.
Observability errors are swallowed; they must never crash the worker.
Watchdog interval is unref'd so it can't keep the event loop alive.
- New `events <job-id> [--since|--after-seq|--limit] [--json]` subcommand
delegating to readJobEvents. Default output is human-readable lines;
--json returns {jobId, eventsFile, count, events}. After-seq + since
semantics match readJobEvents (after-seq takes precedence). printUsage
updated.
Tests: 7 integration tests spawning the real CLI to verify the events
command's empty-result, append-and-read, after-seq filtering, limit
capping, human-readable rendering, and missing-jobId usage paths.
116/116 pass.
The stall threshold is configurable via env var for now
(CODEX_COMPANION_STALL_SECONDS). A `task --max-stall-seconds <N>` CLI
flag is deferred to a later commit so handleTask's option parser stays
untouched in this slice.
E2E against real codex CLI 0.131.0-alpha.9 uncovered three issues mock
tests could not surface. Fixed all three; rerun confirms zero `unknown`
phase events in a successful turn.
1. handleTaskWorker job/exited bug (correctness)
runTrackedJob does NOT throw when codex returns a failed turn — it
resolves with execution.exitStatus != 0 and writes state.status="failed"
out-of-band. Previous code keyed `completed` off the absence of an
exception, so a failed turn was misreported as `phase:"completed",
exitCode:0` in the terminal job/exited event. This is the single
record main-loop Claude reads to decide "did this work?" — getting
it wrong led to silent false positives. Fix: inspect
execution.exitStatus directly. Bug repro: dispatch with --effort
minimal (which conflicts with web_search) — codex returns an
invalid_request_error, runTrackedJob resolves with exitStatus=1,
stored state.json shows status="failed", but the prior code emitted
{phase:"completed", exitCode:0}. After fix: {phase:"failed",
exitCode:1, errorMessage:"Task did not complete successfully ..."}.
2. normalize coverage gaps (forward-compat / readability)
codex 0.131 emits three notification methods + three item types the
prior switch did not recognize, leaving them as phase:"unknown" with
generic fallback messages:
- method `thread/status/changed` with status.type in {"idle"} →
phase:"idle", message "Thread idle.". Post-turn quiescence signal.
- method `thread/tokenUsage/updated` → phase:"metering" (new word in
the phase vocab). Streaming token-usage source. This is the real
event main-loop Claude should poll to detect context-budget pressure
before turn/completed surfaces final usage. NOTE: codex 0.131 payload
shape for the usage object is not documented; normalize tries
{inputTokens,outputTokens,cachedInputTokens} and several aliases;
when no recognized keys are found, falls back to a stable label and
preserves raw. Concrete schema discovery is a Phase 3 follow-up.
- item.type `agentMessage` / `assistantMessage` / `reasoning` →
phase:"thinking". `agentMessage` is codex's final-reply item;
describeStartedItem/describeCompletedItem now use a new
extractItemText helper to surface a content preview ("Codex
replied: pong") so main-loop Claude can recognize the answer
from the event stream without fetching /codex:result.
3. Single E2E rerun verification
Same prompt with --effort low (avoids the minimal+web_search
constraint): 13 events emitted, phase distribution {thinking:8,
completed:2, idle:1, metering:1, warning:1, unknown:0}. The prior
normalize generated 5 unknown out of 9 events (56%); this commit
brings it to 0/13 on the happy path.
Tests: +6 new normalize unit tests (status idle, tokenUsage with usage,
tokenUsage without usage, agentMessage with text, agentMessage with
content[].text, schema fallback). 128/128 pass total. The job/exited
fix is exercised in the existing fake-codex fixture indirectly; a
direct unit test would need a fake runTrackedJob that simulates
"resolved with non-zero exitStatus" which is non-trivial to mock — for
now the E2E repro and the inline reasoning in the comment are the
documented evidence.
Closes the "operate codex like a Claude subagent" half of the design
contract. Phase 1+2 gave the main loop an event stream; Phase 3 gives
it the slash-command surface and the protocol-native recovery path.
In lib/codex.mjs:
- New exported compactAppServerThread(cwd, {threadId}). Wraps codex
app-server's thread/compact/start RPC. This is the protocol-native
recovery for "prompt too long" — main-loop Claude calls it after a
turn fails with context-overflow, then resumes via /codex:rescue
--resume <amended prompt>. Fire-and-return: the call awaits the
app-server's ack but does not consume the streaming response. The
broker recognizes thread/compact/start as STREAMING_METHOD but
routes the stream to whoever owns it at that moment; compaction
itself completes on the codex side regardless of consumer presence.
Uses reuseExistingBroker: true so it can punch through to an
already-running broker if one exists. The exact success payload
shape is undocumented in codex CLI 0.131; result is preserved
verbatim under .result for forward-compat.
In codex-companion.mjs:
- New `compact <thread-id> [--json]` subcommand wrapping
compactAppServerThread. Plain output prints the operation result and
hints the resume flow; --json returns the full structured report.
Smoke-tested against real codex 0.131: bogus thread id correctly
returns attempted:true, compacted:false, with codex's own error
("invalid thread id") preserved under .detail.
In agents/codex-rescue.md + commands/rescue.md:
- Default execution mode flipped from foreground to background. The
prior heuristic ("small bounded => foreground; complex => background")
was the deadlock root cause: a small task that stalls is still a
deadlock, and "small" is unknowable in advance. Background is the
safe default — the main loop polls /codex:events <job-id> for
progress instead of blocking on the synchronous Bash call. --wait is
honored when the user explicitly asks for foreground.
In commands/:
- New events.md: slash command surfacing the per-job event stream.
Documents the {type:"job/exited"} terminal-state contract, the
phase:"stuck" watchdog signal, the phase:"metering" token-usage
source, and the --after-seq incremental polling pattern.
- New compact.md: slash command for the recovery sequence. Documents
the typical "cancel → compact → resume with amended prompt" idiom
main-loop Claude should follow when codex hits context overflow.
In tests/commands.test.mjs:
- Updated assertions to reflect the new background default and to
include events.md + compact.md in the commands/ file-list invariant.
The prior assertions hard-coded the prose "default to foreground" and
the file list of 7 commands; both needed to track this change.
Tests: 128/128 pass. No new tests added — the compact path is exercised
indirectly via a smoke check (bogus thread id returns the expected
structured error against the real app-server). A full E2E for compact
needs a real turn first to obtain a valid thread id; that lands in
Phase 4 alongside version bump + CHANGELOG.
Cuts the minor-version release for the observability rework (feat/event-stream-foundation). Adds /codex:events, /codex:compact, rescue defaults to --background, stall watchdog, terminal job/exited events, top-level token usage, and broader codex CLI 0.131 notification coverage. No breaking changes — all existing commands keep the same signatures and outputs; the new event stream is additive.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f9f36fb016
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (options.json) { | ||
| console.log(JSON.stringify(report, null, 2)); | ||
| return; |
There was a problem hiding this comment.
Preserve failure exit status for JSON compaction
When /codex:compact --json is used and the app-server rejects the compaction request (for example, a malformed or stale thread id), this branch prints the failed {attempted:true, compacted:false} report and returns with exit code 0. The non-JSON path below correctly sets process.exitCode = 1, and the command documentation promises non-zero on rejection; with --json, scripts or agents can treat failed compaction as success and proceed to resume an uncompacted thread.
Useful? React with 👍 / 👎.
| - If the request includes `--background`, run the `codex:codex-rescue` subagent in the background. | ||
| - If the request includes `--wait`, run the `codex:codex-rescue` subagent in the foreground. | ||
| - If neither flag is present, default to foreground. | ||
| - If neither flag is present, default to background. The companion returns a job id immediately; poll progress with `/codex:events <job-id> --since <last-seq> --json` (or `/codex:status <job-id>`) instead of blocking on a synchronous Bash call. This prevents the main Claude loop from deadlocking when Codex stalls or fails silently. |
There was a problem hiding this comment.
Use the sequence polling flag in rescue guidance
This default-background guidance tells the main loop to poll with --since <last-seq>, but the events command treats --since as an ISO timestamp and uses --after-seq for sequence numbers. If a model follows this instruction with a numeric last sequence, ISO timestamps compare greater than that string and the poll returns old events repeatedly, defeating incremental monitoring; the example should use --after-seq <last-seq>.
Useful? React with 👍 / 👎.
|
Closing — switching to a fork-internal PR (suminerProxy/codex-plugin-cc) so the owner can self-merge. Re-opening if upstream eventually wants this contribution. |
|
Was the watchdog / compact-recovery / event-stream work here folded into another effort — PR #343 looks adjacent on broker cleanup, or is the watchdog path being addressed elsewhere? Tracking the broker-leak issue and want to make sure I'm following the active thread. |
Why
When Claude Code's main loop delegates a coding task to Codex via
/codex:rescue, the subagent makes a single synchronousBashcall intothe companion
taskcommand and waits for it to return. If Codex stalls(prompt too long, model hang, tool deadlock, API rate limit), the main
loop stalls too — there is no out-of-band signal it can poll to detect
the freeze, and
/codex:statusonly exposes coarse job-level phase.This PR adds the missing channel: a per-job NDJSON event stream the main
Claude loop can poll with
/codex:events, plus a stall watchdog thatemits a
phase:"stuck"event when codex goes quiet, and aprotocol-native
/codex:compactrecovery path for context overflow./codex:rescuenow defaults to--backgroundso the main loop neverblocks waiting on the synchronous Bash call.
What ships in 1.1.0
/codex:events <job-id>— incremental polling of normalized codexnotifications. Supports
--since/--after-seq/--limit/--json./codex:compact <thread-id>— wrapsthread/compact/startfor thecancel → compact → resume-with-amended-prompt flow.
/codex:rescuedefault flipped to--background. The main loop polls/codex:eventsinstead of blocking.CODEX_COMPANION_STALL_SECONDS)emits
{type:"watchdog", phase:"stuck"}. Does not cancel.{type:"job/exited"}terminal event — single source of truth forend-of-job. Fixes a correctness bug where codex turn failures were
reported as
phase:"completed"becauserunTrackedJobresolves withexitStatus != 0rather than throwing.runAppServerTurnsurfacesusageat top level; real-time updatesflow via
thread/tokenUsage/updatedevents (phase:"metering").thread/status/changed,warning,thread/tokenUsage/updated, plus item typesuserMessage,assistantMessage/agentMessage,reasoning. E2E against realcodex 0.131 shows
unknownphase count goes from 5/9 (56%) to 0/13.Architecture
The event capture point is
captureTurn's notification handler, NOT thebroker. The broker is a single-flight, single-stream-owner router
(
app-server-broker.mjs:84-90) — it cannot do per-job fan-out without arewrite. Instead,
captureTurnaccepts anonNotificationcallback thatruns after
applyTurnNotification, normalizes the message, and thetask-worker appends to
{stateDir}/jobs/{jobId}.events.ndjson. Thewatchdog
setIntervalshares a closure with the same handler in theworker process so cross-process state synchronization is unneeded.
Single-job-at-a-time is preserved (matches broker reality). Concurrent
multi-job and broker fan-out are deferred to a future PR.
Test plan
npm test: 128 / 128 pass. Adds 42 new tests (10 events API, 25normalize, 7 events command integration).
CLAUDE_PLUGIN_DATAandCODEX_COMPANION_SESSION_IDintests/helpers.mjssonpm testworks inside a plugin hostruntime. This branch includes that commit.
/codex:review,/codex:adversarial-review,/codex:result,/codex:cancel,/codex:statusall keep the samesignatures and output.
Verification status
Verified end-to-end against codex CLI 0.131.0-alpha.9:
job/exited phase:"completed" exitCode:0, 0 unknown-phase events. Token usage event + idle event +reply preview ("Codex replied: pong") all surface correctly.
--effort minimal "..."triggerscodex's
invalid_request_error(web_search incompatible withminimal effort). Result:
job/exited phase:"failed" exitCode:1 errorMessage:"Task did not complete successfully (see prior events for details)."Storedstate.jsonshowsstatus:"failed" phase:"failed"— the verdict matches what runs through the eventsstream. This is the bug fix in
3a0ae2d: before this PR,job/exitedreported
phase:"completed" exitCode:0because the worker keyed offthrown exceptions rather than
execution.exitStatus.CODEX_COMPANION_STALL_SECONDS=5+--effort xhighreasoning task. After 8.54s of silence inside a Reasoning item(longer than the 5s threshold), the worker emitted
{type:"watchdog", phase:"stuck", stallMs:8540, since:"<ts of last event>"}. Codex resumed naturally a moment later; the watchdog didNOT cancel the job and did NOT re-fire spuriously when subsequent
short quiet periods (~2s) occurred. The
sincetimestamp alignsexactly with the previous notification's
ts./codex:compact <bogus-id>returns
attempted:true, compacted:false, transport:"direct", detail:"invalid thread id: ..."— the codex-side error is preserved.Open schema unknowns (does not block this PR)
thread/compact/startsuccess payload shape is undocumented forcodex CLI 0.131. The wrapper preserves it verbatim under
.resultfor forward-compat. The rejection path is structured; the success
path shape is the open question for the next docs revision.
thread/tokenUsage/updatedpayload field names not confirmed againstcodex CLI 0.131 docs. The normalize function tries
inputTokens/outputTokens/cachedInputTokensand a few aliases, falls back toa stable "Token usage updated." label. The event itself still fires
correctly (
phase:"metering"); only the human-readable token countin
.messagerequires knowing the exact field names.Both are tracked at the relevant code paths.
Build status
CI on this branch will need to handle the existing upstream
incompatibility with codex CLI 0.131 (
requestAttestation/experimentalRawEventsschema drift inplugins/codex/scripts/lib/app-server.mjsandlib/codex.mjs:64).That is a pre-existing condition on
mainand is intentionally notaddressed in this PR to keep the diff focused on observability.