Skip to content

feat: per-job event stream + watchdog + compact recovery (1.1.0)#335

Closed
suminerProxy wants to merge 7 commits into
openai:mainfrom
suminerProxy:feat/event-stream-foundation
Closed

feat: per-job event stream + watchdog + compact recovery (1.1.0)#335
suminerProxy wants to merge 7 commits into
openai:mainfrom
suminerProxy:feat/event-stream-foundation

Conversation

@suminerProxy
Copy link
Copy Markdown

@suminerProxy suminerProxy commented May 19, 2026

Why

When Claude Code's main loop delegates a coding task to Codex via
/codex:rescue, the subagent makes a single synchronous Bash call into
the companion task command and waits for it to return. If Codex stalls
(prompt too long, model hang, tool deadlock, API rate limit), the main
loop stalls too — there is no out-of-band signal it can poll to detect
the freeze, and /codex:status only exposes coarse job-level phase.

This PR adds the missing channel: a per-job NDJSON event stream the main
Claude loop can poll with /codex:events, plus a stall watchdog that
emits a phase:"stuck" event when codex goes quiet, and a
protocol-native /codex:compact recovery path for context overflow.
/codex:rescue now defaults to --background so the main loop never
blocks waiting on the synchronous Bash call.

What ships in 1.1.0

  • /codex:events <job-id> — incremental polling of normalized codex
    notifications. Supports --since / --after-seq / --limit / --json.
  • /codex:compact <thread-id> — wraps thread/compact/start for the
    cancel → compact → resume-with-amended-prompt flow.
  • /codex:rescue default flipped to --background. The main loop polls
    /codex:events instead of blocking.
  • 60s per-job stall watchdog (override via CODEX_COMPANION_STALL_SECONDS)
    emits {type:"watchdog", phase:"stuck"}. Does not cancel.
  • New {type:"job/exited"} terminal event — single source of truth for
    end-of-job. Fixes a correctness bug where codex turn failures were
    reported as phase:"completed" because runTrackedJob resolves with
    exitStatus != 0 rather than throwing.
  • runAppServerTurn surfaces usage at top level; real-time updates
    flow via thread/tokenUsage/updated events (phase:"metering").
  • codex CLI 0.131 notification coverage extended: thread/status/changed,
    warning, thread/tokenUsage/updated, plus item types userMessage,
    assistantMessage / agentMessage, reasoning. E2E against real
    codex 0.131 shows unknown phase count goes from 5/9 (56%) to 0/13.

Architecture

The event capture point is captureTurn's notification handler, NOT the
broker. The broker is a single-flight, single-stream-owner router
(app-server-broker.mjs:84-90) — it cannot do per-job fan-out without a
rewrite. Instead, captureTurn accepts an onNotification callback that
runs after applyTurnNotification, normalizes the message, and the
task-worker appends to {stateDir}/jobs/{jobId}.events.ndjson. The
watchdog setInterval shares a closure with the same handler in the
worker process so cross-process state synchronization is unneeded.

Single-job-at-a-time is preserved (matches broker reality). Concurrent
multi-job and broker fan-out are deferred to a future PR.

Test plan

  • npm test: 128 / 128 pass. Adds 42 new tests (10 events API, 25
    normalize, 7 events command integration).
  • Test isolation: separate prior PR (test: isolate from host-runtime plugin env vars #334 if accepted) unsets
    CLAUDE_PLUGIN_DATA and CODEX_COMPANION_SESSION_ID in
    tests/helpers.mjs so npm test works inside a plugin host
    runtime. This branch includes that commit.
  • No breaking changes — /codex:review, /codex:adversarial-review,
    /codex:result, /codex:cancel, /codex:status all keep the same
    signatures and output.

Verification status

Verified end-to-end against codex CLI 0.131.0-alpha.9:

  • Success path: 13 events, job/exited phase:"completed" exitCode:0, 0 unknown-phase events. Token usage event + idle event +
    reply preview ("Codex replied: pong") all surface correctly.
  • Failed path: dispatch with --effort minimal "..." triggers
    codex's invalid_request_error (web_search incompatible with
    minimal effort). Result: job/exited phase:"failed" exitCode:1 errorMessage:"Task did not complete successfully (see prior events for details)." Stored state.json shows status:"failed" phase:"failed" — the verdict matches what runs through the events
    stream. This is the bug fix in 3a0ae2d: before this PR, job/exited
    reported phase:"completed" exitCode:0 because the worker keyed off
    thrown exceptions rather than execution.exitStatus.
  • Stall watchdog: CODEX_COMPANION_STALL_SECONDS=5 + --effort xhigh reasoning task. After 8.54s of silence inside a Reasoning item
    (longer than the 5s threshold), the worker emitted
    {type:"watchdog", phase:"stuck", stallMs:8540, since:"<ts of last event>"}. Codex resumed naturally a moment later; the watchdog did
    NOT cancel the job and did NOT re-fire spuriously when subsequent
    short quiet periods (~2s) occurred. The since timestamp aligns
    exactly with the previous notification's ts.
  • Compact rejection path: smoke test /codex:compact <bogus-id>
    returns attempted:true, compacted:false, transport:"direct", detail:"invalid thread id: ..." — the codex-side error is preserved.

Open schema unknowns (does not block this PR)

  • thread/compact/start success payload shape is undocumented for
    codex CLI 0.131. The wrapper preserves it verbatim under .result
    for forward-compat. The rejection path is structured; the success
    path shape is the open question for the next docs revision.
  • thread/tokenUsage/updated payload field names not confirmed against
    codex CLI 0.131 docs. The normalize function tries inputTokens /
    outputTokens / cachedInputTokens and a few aliases, falls back to
    a stable "Token usage updated." label. The event itself still fires
    correctly (phase:"metering"); only the human-readable token count
    in .message requires knowing the exact field names.

Both are tracked at the relevant code paths.

Build status

CI on this branch will need to handle the existing upstream
incompatibility with codex CLI 0.131 (requestAttestation /
experimentalRawEvents schema drift in
plugins/codex/scripts/lib/app-server.mjs and lib/codex.mjs:64).
That is a pre-existing condition on main and is intentionally not
addressed in this PR to keep the diff focused on observability.

bit-star added 7 commits May 19, 2026 15:59
Host plugin runtimes (e.g. Claude Code) inject CLAUDE_PLUGIN_DATA and
CODEX_COMPANION_SESSION_ID into the plugin process. When these leak into
`npm test`, two tests fail with non-bug assertions:

- tests/state.test.mjs: "resolveStateDir uses a temp-backed per-workspace
  directory" — CLAUDE_PLUGIN_DATA redirects state to ~/.claude/plugins/data
  instead of os.tmpdir().
- tests/runtime.test.mjs: "result without a job id ..." — fixture writer
  and the spawned `result` subprocess look at different state dirs.

CI passes (no host-runtime env), so this only bites contributors running
`npm test` from inside a plugin host. Fix: delete these env vars at the top
of tests/helpers.mjs (the shared import) so every test gets a clean slate.
Tests that exercise the env var explicitly ("resolveStateDir uses
CLAUDE_PLUGIN_DATA when it is provided") already use try/finally to set and
restore it, so they continue to pass.

Verified: 86/86 tests pass with the polluting env vars still set in the
parent shell.
Add resolveJobEventsFile / appendJobEvent / readJobEvents for the upcoming
Claude-main-loop observability work (see DESIGN doc at user planning area).

Contract:
- Per-job events file at {stateDir}/jobs/{jobId}.events.ndjson.
- appendJobEvent caps each line at 4KB. POSIX `write(fd, buf, n)` with
  O_APPEND is atomic when n < PIPE_BUF (~4096 on Linux/macOS), so single-
  write append is safe across concurrent readers. Oversized lines first
  truncate the `raw` field; if still too big, the whole event is replaced
  with a minimal {type: "oversize-event-elided", seq, ts, method, phase}.
- readJobEvents returns [] for missing files. Partial last line (writer
  mid-write before the trailing \n) is tolerated — JSON.parse failure on
  the trailing line is skipped, next read picks it up. afterSeq takes
  precedence over since when both are passed.

This commit is consumer-less; the producer side (captureTurn onNotification
hook) lands in the next commit so this slice can ship green independently.

Tests: 10 new unit tests covering path resolution, append+read, missing
file, afterSeq filter, since filter, afterSeq precedence over since, limit,
partial-line tolerance, raw truncation, and oversize elision. 96/96 pass.
Three changes to codex.mjs, building on the per-job NDJSON event API:

1. New exported `normalizeNotification(state, message)` function — pure
   transform from app-server notifications to the flat event shape that
   gets appended to {jobId}.events.ndjson. Covers thread/started,
   thread/name/updated, turn/started, item/started, item/completed,
   error, turn/completed, and an "unknown" fallback for forward-compat
   methods (e.g. thread/compact/started). Phase inference reuses the
   existing describeStartedItem / describeCompletedItem maps so the
   vocabulary stays consistent with on-screen progress text.

2. `captureTurn` accepts an `options.onNotification` callback. A small
   `dispatch(message)` wrapper applies state mutation first, then emits
   the normalized event. Order is load-bearing: normalize reads
   state.threadTurnIds populated by applyTurnNotification. onNotification
   errors are swallowed so a broken consumer can never crash the worker.

3. `runAppServerTurn` forwards `options.onNotification` to captureTurn
   and surfaces `usage` (from turnState.finalTurn?.usage) as a top-level
   field on the result, so /codex:status and the events stream can read
   token usage without traversing the nested `turn` payload.

Tests: 13 new unit tests covering each method branch, phase inference
edge cases (commandExecution -> running vs verifying), unknown-method
fallback, and malformed-input resilience. 109/109 pass.

No behavioral change for callers that don't pass onNotification — the
hook is fully opt-in. The companion-side wiring lands separately.
End-to-end wiring of the observability path landed in the prior two commits
(state.mjs events API + codex.mjs notification hook).

In codex-companion.mjs:

- executeTaskRun: forward `onNotification` through to runAppServerTurn.
  Sync foreground callers still don't pass one; only the background
  task-worker path opts in.

- handleTaskWorker: build a same-process closure that
  (a) appends every normalized notification to {jobId}.events.ndjson
      with a monotonic per-job seq number;
  (b) reflects phase transitions into state.json via upsertJob (only on
      real phase change, to keep the unlocked single-flight writer rare);
  (c) runs a 5s stall watchdog — when the gap since the last event
      exceeds CODEX_COMPANION_STALL_SECONDS (default 60s), it emits one
      {type:"watchdog", phase:"stuck"} record and flips job phase to
      stuck. It does NOT cancel. Main-loop Claude decides what to do
      next (continue / compact / cancel).
  (d) on terminal exit (success OR failure), emits a single
      {type:"job/exited", phase:"completed"|"failed", exitCode, errorMessage}
      record so a polling reader can distinguish "still slow" from
      "already finished". The job.status state.json field is no longer
      the sole source of truth — main-loop Claude must look for the
      job/exited event.
  Observability errors are swallowed; they must never crash the worker.
  Watchdog interval is unref'd so it can't keep the event loop alive.

- New `events <job-id> [--since|--after-seq|--limit] [--json]` subcommand
  delegating to readJobEvents. Default output is human-readable lines;
  --json returns {jobId, eventsFile, count, events}. After-seq + since
  semantics match readJobEvents (after-seq takes precedence). printUsage
  updated.

Tests: 7 integration tests spawning the real CLI to verify the events
command's empty-result, append-and-read, after-seq filtering, limit
capping, human-readable rendering, and missing-jobId usage paths.
116/116 pass.

The stall threshold is configurable via env var for now
(CODEX_COMPANION_STALL_SECONDS). A `task --max-stall-seconds <N>` CLI
flag is deferred to a later commit so handleTask's option parser stays
untouched in this slice.
E2E against real codex CLI 0.131.0-alpha.9 uncovered three issues mock
tests could not surface. Fixed all three; rerun confirms zero `unknown`
phase events in a successful turn.

1. handleTaskWorker job/exited bug (correctness)
   runTrackedJob does NOT throw when codex returns a failed turn — it
   resolves with execution.exitStatus != 0 and writes state.status="failed"
   out-of-band. Previous code keyed `completed` off the absence of an
   exception, so a failed turn was misreported as `phase:"completed",
   exitCode:0` in the terminal job/exited event. This is the single
   record main-loop Claude reads to decide "did this work?" — getting
   it wrong led to silent false positives. Fix: inspect
   execution.exitStatus directly. Bug repro: dispatch with --effort
   minimal (which conflicts with web_search) — codex returns an
   invalid_request_error, runTrackedJob resolves with exitStatus=1,
   stored state.json shows status="failed", but the prior code emitted
   {phase:"completed", exitCode:0}. After fix: {phase:"failed",
   exitCode:1, errorMessage:"Task did not complete successfully ..."}.

2. normalize coverage gaps (forward-compat / readability)
   codex 0.131 emits three notification methods + three item types the
   prior switch did not recognize, leaving them as phase:"unknown" with
   generic fallback messages:

   - method `thread/status/changed` with status.type in {"idle"} →
     phase:"idle", message "Thread idle.". Post-turn quiescence signal.
   - method `thread/tokenUsage/updated` → phase:"metering" (new word in
     the phase vocab). Streaming token-usage source. This is the real
     event main-loop Claude should poll to detect context-budget pressure
     before turn/completed surfaces final usage. NOTE: codex 0.131 payload
     shape for the usage object is not documented; normalize tries
     {inputTokens,outputTokens,cachedInputTokens} and several aliases;
     when no recognized keys are found, falls back to a stable label and
     preserves raw. Concrete schema discovery is a Phase 3 follow-up.
   - item.type `agentMessage` / `assistantMessage` / `reasoning` →
     phase:"thinking". `agentMessage` is codex's final-reply item;
     describeStartedItem/describeCompletedItem now use a new
     extractItemText helper to surface a content preview ("Codex
     replied: pong") so main-loop Claude can recognize the answer
     from the event stream without fetching /codex:result.

3. Single E2E rerun verification
   Same prompt with --effort low (avoids the minimal+web_search
   constraint): 13 events emitted, phase distribution {thinking:8,
   completed:2, idle:1, metering:1, warning:1, unknown:0}. The prior
   normalize generated 5 unknown out of 9 events (56%); this commit
   brings it to 0/13 on the happy path.

Tests: +6 new normalize unit tests (status idle, tokenUsage with usage,
tokenUsage without usage, agentMessage with text, agentMessage with
content[].text, schema fallback). 128/128 pass total. The job/exited
fix is exercised in the existing fake-codex fixture indirectly; a
direct unit test would need a fake runTrackedJob that simulates
"resolved with non-zero exitStatus" which is non-trivial to mock — for
now the E2E repro and the inline reasoning in the comment are the
documented evidence.
Closes the "operate codex like a Claude subagent" half of the design
contract. Phase 1+2 gave the main loop an event stream; Phase 3 gives
it the slash-command surface and the protocol-native recovery path.

In lib/codex.mjs:

- New exported compactAppServerThread(cwd, {threadId}). Wraps codex
  app-server's thread/compact/start RPC. This is the protocol-native
  recovery for "prompt too long" — main-loop Claude calls it after a
  turn fails with context-overflow, then resumes via /codex:rescue
  --resume <amended prompt>. Fire-and-return: the call awaits the
  app-server's ack but does not consume the streaming response. The
  broker recognizes thread/compact/start as STREAMING_METHOD but
  routes the stream to whoever owns it at that moment; compaction
  itself completes on the codex side regardless of consumer presence.
  Uses reuseExistingBroker: true so it can punch through to an
  already-running broker if one exists. The exact success payload
  shape is undocumented in codex CLI 0.131; result is preserved
  verbatim under .result for forward-compat.

In codex-companion.mjs:

- New `compact <thread-id> [--json]` subcommand wrapping
  compactAppServerThread. Plain output prints the operation result and
  hints the resume flow; --json returns the full structured report.
  Smoke-tested against real codex 0.131: bogus thread id correctly
  returns attempted:true, compacted:false, with codex's own error
  ("invalid thread id") preserved under .detail.

In agents/codex-rescue.md + commands/rescue.md:

- Default execution mode flipped from foreground to background. The
  prior heuristic ("small bounded => foreground; complex => background")
  was the deadlock root cause: a small task that stalls is still a
  deadlock, and "small" is unknowable in advance. Background is the
  safe default — the main loop polls /codex:events <job-id> for
  progress instead of blocking on the synchronous Bash call. --wait is
  honored when the user explicitly asks for foreground.

In commands/:

- New events.md: slash command surfacing the per-job event stream.
  Documents the {type:"job/exited"} terminal-state contract, the
  phase:"stuck" watchdog signal, the phase:"metering" token-usage
  source, and the --after-seq incremental polling pattern.
- New compact.md: slash command for the recovery sequence. Documents
  the typical "cancel → compact → resume with amended prompt" idiom
  main-loop Claude should follow when codex hits context overflow.

In tests/commands.test.mjs:

- Updated assertions to reflect the new background default and to
  include events.md + compact.md in the commands/ file-list invariant.
  The prior assertions hard-coded the prose "default to foreground" and
  the file list of 7 commands; both needed to track this change.

Tests: 128/128 pass. No new tests added — the compact path is exercised
indirectly via a smoke check (bogus thread id returns the expected
structured error against the real app-server). A full E2E for compact
needs a real turn first to obtain a valid thread id; that lands in
Phase 4 alongside version bump + CHANGELOG.
Cuts the minor-version release for the observability rework
(feat/event-stream-foundation). Adds /codex:events, /codex:compact,
rescue defaults to --background, stall watchdog, terminal job/exited
events, top-level token usage, and broader codex CLI 0.131 notification
coverage. No breaking changes — all existing commands keep the same
signatures and outputs; the new event stream is additive.
@suminerProxy suminerProxy requested a review from a team May 19, 2026 09:47
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f9f36fb016

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +952 to +954
if (options.json) {
console.log(JSON.stringify(report, null, 2));
return;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve failure exit status for JSON compaction

When /codex:compact --json is used and the app-server rejects the compaction request (for example, a malformed or stale thread id), this branch prints the failed {attempted:true, compacted:false} report and returns with exit code 0. The non-JSON path below correctly sets process.exitCode = 1, and the command documentation promises non-zero on rejection; with --json, scripts or agents can treat failed compaction as success and proceed to resume an uncompacted thread.

Useful? React with 👍 / 👎.

- If the request includes `--background`, run the `codex:codex-rescue` subagent in the background.
- If the request includes `--wait`, run the `codex:codex-rescue` subagent in the foreground.
- If neither flag is present, default to foreground.
- If neither flag is present, default to background. The companion returns a job id immediately; poll progress with `/codex:events <job-id> --since <last-seq> --json` (or `/codex:status <job-id>`) instead of blocking on a synchronous Bash call. This prevents the main Claude loop from deadlocking when Codex stalls or fails silently.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use the sequence polling flag in rescue guidance

This default-background guidance tells the main loop to poll with --since <last-seq>, but the events command treats --since as an ISO timestamp and uses --after-seq for sequence numbers. If a model follows this instruction with a numeric last sequence, ISO timestamps compare greater than that string and the poll returns old events repeatedly, defeating incremental monitoring; the example should use --after-seq <last-seq>.

Useful? React with 👍 / 👎.

@suminerProxy
Copy link
Copy Markdown
Author

Closing — switching to a fork-internal PR (suminerProxy/codex-plugin-cc) so the owner can self-merge. Re-opening if upstream eventually wants this contribution.

@Pelton
Copy link
Copy Markdown

Pelton commented May 22, 2026

Was the watchdog / compact-recovery / event-stream work here folded into another effort — PR #343 looks adjacent on broker cleanup, or is the watchdog path being addressed elsewhere? Tracking the broker-leak issue and want to make sure I'm following the active thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants