Skip to content

feat: per-job event stream + watchdog + compact recovery (1.1.0)#2

Merged
suminerProxy merged 7 commits into
mainfrom
feat/event-stream-foundation
May 19, 2026
Merged

feat: per-job event stream + watchdog + compact recovery (1.1.0)#2
suminerProxy merged 7 commits into
mainfrom
feat/event-stream-foundation

Conversation

@suminerProxy
Copy link
Copy Markdown
Owner

@suminerProxy suminerProxy commented May 19, 2026

Why

When Claude Code's main loop delegates a coding task to Codex via `/codex:rescue`, the subagent makes a single synchronous `Bash` call into the companion `task` command and waits for it to return. If Codex stalls (prompt too long, model hang, tool deadlock), the main loop stalls too — there is no out-of-band signal it can poll to detect the freeze.

This PR adds the missing channel: a per-job NDJSON event stream the main Claude loop can poll with `/codex:events`, plus a stall watchdog that emits `phase:"stuck"` when codex goes quiet, and a protocol-native `/codex:compact` recovery path for context overflow. `/codex:rescue` now defaults to `--background` so the main loop never blocks on a synchronous Bash call.

What ships in 1.1.0

  • `/codex:events ` — incremental polling of normalized codex notifications (`--since` / `--after-seq` / `--limit` / `--json`).
  • `/codex:compact ` — wraps `thread/compact/start` for the cancel → compact → resume-with-amended-prompt flow.
  • `/codex:rescue` default flipped to `--background`.
  • 60s per-job stall watchdog (override via `CODEX_COMPANION_STALL_SECONDS`) emits `{type:"watchdog", phase:"stuck"}`. Does not cancel.
  • New `{type:"job/exited"}` terminal event with `phase: completed|failed`. Fixes a bug where codex turn failures were misreported as `phase:"completed"`.
  • Top-level `usage` on `runAppServerTurn`; real-time updates via `thread/tokenUsage/updated` events (`phase:"metering"`).
  • codex CLI 0.131 notification coverage extended (`thread/status/changed`, `warning`, `thread/tokenUsage/updated`, `userMessage`, `assistantMessage`/`agentMessage`, `reasoning`). E2E shows `unknown` phase count goes from 5/9 (56%) to 0/13.

Includes #1 (test isolation) as a prerequisite

This branch contains 8 commits, the first of which (`5e8100a`) is the same fix as #1. Merging this PR alone brings both fixes in; merging #1 first and then this PR is also fine (git will skip the already-applied commit).

Verification

  • ✅ `npm test`: 128 / 128 pass (baseline 86 + 42 new tests).
  • ✅ Success path E2E against real codex 0.131.0-alpha.9: 13 events, `job/exited phase:"completed" exitCode:0`, 0 unknown-phase events.
  • ✅ Failed path E2E: `--effort minimal` → `invalid_request_error` → `job/exited phase:"failed" exitCode:1 errorMessage:"Task did not complete successfully (...)"`. Stored `state.json` matches.
  • ✅ Stall watchdog: `CODEX_COMPANION_STALL_SECONDS=5` + xhigh reasoning → `{type:"watchdog", phase:"stuck", stallMs:8540}` after a Reasoning item went silent 8.54s. Did not cancel, did not re-fire spuriously when subsequent ~2s quiet periods occurred.
  • ✅ Compact rejection path: `/codex:compact ` → `attempted:true, compacted:false, detail:"invalid thread id: ..."`.

Open schema unknowns (do not block merge)

  • `thread/compact/start` success payload shape is undocumented for codex CLI 0.131. Wrapper preserves it verbatim under `.result`.
  • `thread/tokenUsage/updated` payload field names not confirmed. Normalize tries several aliases; falls back to a stable label. The event still fires correctly (`phase:"metering"`).

Build status

Pre-existing upstream incompatibility with codex CLI 0.131 (`requestAttestation` / `experimentalRawEvents` schema drift) makes `npm run build` fail on `main` too. Not introduced by this PR; intentionally not addressed to keep the diff focused.

(Previously opened as openai#335 — closed in favor of this fork-internal PR so the repo owner can self-merge.)

Summary by CodeRabbit

Release Notes v1.1.0

  • New Features

    • Added /codex:events command for streaming per-job notifications and progress monitoring
    • Added /codex:compact command for context overflow recovery
    • Token usage now surfaced as a top-level field with real-time streaming updates
  • Bug Fixes

    • /codex:rescue now defaults to background execution to prevent deadlocks
    • Added stall watchdog monitoring for job health
  • Documentation

    • Updated changelog with new command and feature details
    • Added guidance for background job monitoring via event polling

Review Change Stack

bit-star added 7 commits May 19, 2026 15:59
Host plugin runtimes (e.g. Claude Code) inject CLAUDE_PLUGIN_DATA and
CODEX_COMPANION_SESSION_ID into the plugin process. When these leak into
`npm test`, two tests fail with non-bug assertions:

- tests/state.test.mjs: "resolveStateDir uses a temp-backed per-workspace
  directory" — CLAUDE_PLUGIN_DATA redirects state to ~/.claude/plugins/data
  instead of os.tmpdir().
- tests/runtime.test.mjs: "result without a job id ..." — fixture writer
  and the spawned `result` subprocess look at different state dirs.

CI passes (no host-runtime env), so this only bites contributors running
`npm test` from inside a plugin host. Fix: delete these env vars at the top
of tests/helpers.mjs (the shared import) so every test gets a clean slate.
Tests that exercise the env var explicitly ("resolveStateDir uses
CLAUDE_PLUGIN_DATA when it is provided") already use try/finally to set and
restore it, so they continue to pass.

Verified: 86/86 tests pass with the polluting env vars still set in the
parent shell.
Add resolveJobEventsFile / appendJobEvent / readJobEvents for the upcoming
Claude-main-loop observability work (see DESIGN doc at user planning area).

Contract:
- Per-job events file at {stateDir}/jobs/{jobId}.events.ndjson.
- appendJobEvent caps each line at 4KB. POSIX `write(fd, buf, n)` with
  O_APPEND is atomic when n < PIPE_BUF (~4096 on Linux/macOS), so single-
  write append is safe across concurrent readers. Oversized lines first
  truncate the `raw` field; if still too big, the whole event is replaced
  with a minimal {type: "oversize-event-elided", seq, ts, method, phase}.
- readJobEvents returns [] for missing files. Partial last line (writer
  mid-write before the trailing \n) is tolerated — JSON.parse failure on
  the trailing line is skipped, next read picks it up. afterSeq takes
  precedence over since when both are passed.

This commit is consumer-less; the producer side (captureTurn onNotification
hook) lands in the next commit so this slice can ship green independently.

Tests: 10 new unit tests covering path resolution, append+read, missing
file, afterSeq filter, since filter, afterSeq precedence over since, limit,
partial-line tolerance, raw truncation, and oversize elision. 96/96 pass.
Three changes to codex.mjs, building on the per-job NDJSON event API:

1. New exported `normalizeNotification(state, message)` function — pure
   transform from app-server notifications to the flat event shape that
   gets appended to {jobId}.events.ndjson. Covers thread/started,
   thread/name/updated, turn/started, item/started, item/completed,
   error, turn/completed, and an "unknown" fallback for forward-compat
   methods (e.g. thread/compact/started). Phase inference reuses the
   existing describeStartedItem / describeCompletedItem maps so the
   vocabulary stays consistent with on-screen progress text.

2. `captureTurn` accepts an `options.onNotification` callback. A small
   `dispatch(message)` wrapper applies state mutation first, then emits
   the normalized event. Order is load-bearing: normalize reads
   state.threadTurnIds populated by applyTurnNotification. onNotification
   errors are swallowed so a broken consumer can never crash the worker.

3. `runAppServerTurn` forwards `options.onNotification` to captureTurn
   and surfaces `usage` (from turnState.finalTurn?.usage) as a top-level
   field on the result, so /codex:status and the events stream can read
   token usage without traversing the nested `turn` payload.

Tests: 13 new unit tests covering each method branch, phase inference
edge cases (commandExecution -> running vs verifying), unknown-method
fallback, and malformed-input resilience. 109/109 pass.

No behavioral change for callers that don't pass onNotification — the
hook is fully opt-in. The companion-side wiring lands separately.
End-to-end wiring of the observability path landed in the prior two commits
(state.mjs events API + codex.mjs notification hook).

In codex-companion.mjs:

- executeTaskRun: forward `onNotification` through to runAppServerTurn.
  Sync foreground callers still don't pass one; only the background
  task-worker path opts in.

- handleTaskWorker: build a same-process closure that
  (a) appends every normalized notification to {jobId}.events.ndjson
      with a monotonic per-job seq number;
  (b) reflects phase transitions into state.json via upsertJob (only on
      real phase change, to keep the unlocked single-flight writer rare);
  (c) runs a 5s stall watchdog — when the gap since the last event
      exceeds CODEX_COMPANION_STALL_SECONDS (default 60s), it emits one
      {type:"watchdog", phase:"stuck"} record and flips job phase to
      stuck. It does NOT cancel. Main-loop Claude decides what to do
      next (continue / compact / cancel).
  (d) on terminal exit (success OR failure), emits a single
      {type:"job/exited", phase:"completed"|"failed", exitCode, errorMessage}
      record so a polling reader can distinguish "still slow" from
      "already finished". The job.status state.json field is no longer
      the sole source of truth — main-loop Claude must look for the
      job/exited event.
  Observability errors are swallowed; they must never crash the worker.
  Watchdog interval is unref'd so it can't keep the event loop alive.

- New `events <job-id> [--since|--after-seq|--limit] [--json]` subcommand
  delegating to readJobEvents. Default output is human-readable lines;
  --json returns {jobId, eventsFile, count, events}. After-seq + since
  semantics match readJobEvents (after-seq takes precedence). printUsage
  updated.

Tests: 7 integration tests spawning the real CLI to verify the events
command's empty-result, append-and-read, after-seq filtering, limit
capping, human-readable rendering, and missing-jobId usage paths.
116/116 pass.

The stall threshold is configurable via env var for now
(CODEX_COMPANION_STALL_SECONDS). A `task --max-stall-seconds <N>` CLI
flag is deferred to a later commit so handleTask's option parser stays
untouched in this slice.
E2E against real codex CLI 0.131.0-alpha.9 uncovered three issues mock
tests could not surface. Fixed all three; rerun confirms zero `unknown`
phase events in a successful turn.

1. handleTaskWorker job/exited bug (correctness)
   runTrackedJob does NOT throw when codex returns a failed turn — it
   resolves with execution.exitStatus != 0 and writes state.status="failed"
   out-of-band. Previous code keyed `completed` off the absence of an
   exception, so a failed turn was misreported as `phase:"completed",
   exitCode:0` in the terminal job/exited event. This is the single
   record main-loop Claude reads to decide "did this work?" — getting
   it wrong led to silent false positives. Fix: inspect
   execution.exitStatus directly. Bug repro: dispatch with --effort
   minimal (which conflicts with web_search) — codex returns an
   invalid_request_error, runTrackedJob resolves with exitStatus=1,
   stored state.json shows status="failed", but the prior code emitted
   {phase:"completed", exitCode:0}. After fix: {phase:"failed",
   exitCode:1, errorMessage:"Task did not complete successfully ..."}.

2. normalize coverage gaps (forward-compat / readability)
   codex 0.131 emits three notification methods + three item types the
   prior switch did not recognize, leaving them as phase:"unknown" with
   generic fallback messages:

   - method `thread/status/changed` with status.type in {"idle"} →
     phase:"idle", message "Thread idle.". Post-turn quiescence signal.
   - method `thread/tokenUsage/updated` → phase:"metering" (new word in
     the phase vocab). Streaming token-usage source. This is the real
     event main-loop Claude should poll to detect context-budget pressure
     before turn/completed surfaces final usage. NOTE: codex 0.131 payload
     shape for the usage object is not documented; normalize tries
     {inputTokens,outputTokens,cachedInputTokens} and several aliases;
     when no recognized keys are found, falls back to a stable label and
     preserves raw. Concrete schema discovery is a Phase 3 follow-up.
   - item.type `agentMessage` / `assistantMessage` / `reasoning` →
     phase:"thinking". `agentMessage` is codex's final-reply item;
     describeStartedItem/describeCompletedItem now use a new
     extractItemText helper to surface a content preview ("Codex
     replied: pong") so main-loop Claude can recognize the answer
     from the event stream without fetching /codex:result.

3. Single E2E rerun verification
   Same prompt with --effort low (avoids the minimal+web_search
   constraint): 13 events emitted, phase distribution {thinking:8,
   completed:2, idle:1, metering:1, warning:1, unknown:0}. The prior
   normalize generated 5 unknown out of 9 events (56%); this commit
   brings it to 0/13 on the happy path.

Tests: +6 new normalize unit tests (status idle, tokenUsage with usage,
tokenUsage without usage, agentMessage with text, agentMessage with
content[].text, schema fallback). 128/128 pass total. The job/exited
fix is exercised in the existing fake-codex fixture indirectly; a
direct unit test would need a fake runTrackedJob that simulates
"resolved with non-zero exitStatus" which is non-trivial to mock — for
now the E2E repro and the inline reasoning in the comment are the
documented evidence.
Closes the "operate codex like a Claude subagent" half of the design
contract. Phase 1+2 gave the main loop an event stream; Phase 3 gives
it the slash-command surface and the protocol-native recovery path.

In lib/codex.mjs:

- New exported compactAppServerThread(cwd, {threadId}). Wraps codex
  app-server's thread/compact/start RPC. This is the protocol-native
  recovery for "prompt too long" — main-loop Claude calls it after a
  turn fails with context-overflow, then resumes via /codex:rescue
  --resume <amended prompt>. Fire-and-return: the call awaits the
  app-server's ack but does not consume the streaming response. The
  broker recognizes thread/compact/start as STREAMING_METHOD but
  routes the stream to whoever owns it at that moment; compaction
  itself completes on the codex side regardless of consumer presence.
  Uses reuseExistingBroker: true so it can punch through to an
  already-running broker if one exists. The exact success payload
  shape is undocumented in codex CLI 0.131; result is preserved
  verbatim under .result for forward-compat.

In codex-companion.mjs:

- New `compact <thread-id> [--json]` subcommand wrapping
  compactAppServerThread. Plain output prints the operation result and
  hints the resume flow; --json returns the full structured report.
  Smoke-tested against real codex 0.131: bogus thread id correctly
  returns attempted:true, compacted:false, with codex's own error
  ("invalid thread id") preserved under .detail.

In agents/codex-rescue.md + commands/rescue.md:

- Default execution mode flipped from foreground to background. The
  prior heuristic ("small bounded => foreground; complex => background")
  was the deadlock root cause: a small task that stalls is still a
  deadlock, and "small" is unknowable in advance. Background is the
  safe default — the main loop polls /codex:events <job-id> for
  progress instead of blocking on the synchronous Bash call. --wait is
  honored when the user explicitly asks for foreground.

In commands/:

- New events.md: slash command surfacing the per-job event stream.
  Documents the {type:"job/exited"} terminal-state contract, the
  phase:"stuck" watchdog signal, the phase:"metering" token-usage
  source, and the --after-seq incremental polling pattern.
- New compact.md: slash command for the recovery sequence. Documents
  the typical "cancel → compact → resume with amended prompt" idiom
  main-loop Claude should follow when codex hits context overflow.

In tests/commands.test.mjs:

- Updated assertions to reflect the new background default and to
  include events.md + compact.md in the commands/ file-list invariant.
  The prior assertions hard-coded the prose "default to foreground" and
  the file list of 7 commands; both needed to track this change.

Tests: 128/128 pass. No new tests added — the compact path is exercised
indirectly via a smoke check (bogus thread id returns the expected
structured error against the real app-server). A full E2E for compact
needs a real turn first to obtain a valid thread id; that lands in
Phase 4 alongside version bump + CHANGELOG.
Cuts the minor-version release for the observability rework
(feat/event-stream-foundation). Adds /codex:events, /codex:compact,
rescue defaults to --background, stall watchdog, terminal job/exited
events, top-level token usage, and broader codex CLI 0.131 notification
coverage. No breaking changes — all existing commands keep the same
signatures and outputs; the new event stream is additive.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 14eabfba-cbe4-4900-ad63-ec0ab50c6728

📥 Commits

Reviewing files that changed from the base of the PR and between 807e03a and f9f36fb.

⛔ Files ignored due to path filters (1)
  • package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (16)
  • .claude-plugin/marketplace.json
  • package.json
  • plugins/codex/.claude-plugin/plugin.json
  • plugins/codex/CHANGELOG.md
  • plugins/codex/agents/codex-rescue.md
  • plugins/codex/commands/compact.md
  • plugins/codex/commands/events.md
  • plugins/codex/commands/rescue.md
  • plugins/codex/scripts/codex-companion.mjs
  • plugins/codex/scripts/lib/codex.mjs
  • plugins/codex/scripts/lib/state.mjs
  • tests/commands.test.mjs
  • tests/events-command.test.mjs
  • tests/helpers.mjs
  • tests/normalize-notification.test.mjs
  • tests/state.test.mjs

📝 Walkthrough

Walkthrough

This PR implements Codex plugin v1.1.0, introducing event-driven job observability via NDJSON streaming, context compaction for "prompt too long" recovery, and changes rescue defaults to non-blocking background execution with polling support. The release bumps manifests, adds event storage and normalization infrastructure, wires notifications through task execution, and exposes new CLI commands for monitoring and recovery.

Changes

Codex Plugin v1.1.0: Event Streaming, Context Compaction, and Job Lifecycle Observability

Layer / File(s) Summary
Release versioning and changelog
.claude-plugin/marketplace.json, package.json, plugins/codex/.claude-plugin/plugin.json, plugins/codex/CHANGELOG.md
Version bumped from 1.0.4 to 1.1.0 across all manifests. Changelog documents new observability features, background defaults, token usage export, and test isolation improvements.
Job event storage and NDJSON persistence
plugins/codex/scripts/lib/state.mjs, tests/state.test.mjs
Implements per-job NDJSON event log with atomic single-line writes capped at 4KB, automatic truncation of oversized payloads, filtering by sequence number or timestamp, and full test coverage for append/read/filter operations.
Notification normalization contract and item descriptors
plugins/codex/scripts/lib/codex.mjs, tests/normalize-notification.test.mjs
Exports normalizeNotification() to convert app-server notifications into stable event records. Updates item descriptors to recognize userMessage, assistantMessage, agentMessage, and reasoning types with rich progress messages. Includes 332 lines of tests covering all notification shapes and edge cases.
Notification dispatch in turn capture and token usage export
plugins/codex/scripts/lib/codex.mjs
Extends captureTurn() to emit normalized notifications via optional onNotification callback before returning. Wires callback through runAppServerTurn(), adds top-level usage field to turn result, and exports compactAppServerThread() for context recovery requests to app-server.
Worker task runtime with event streaming, stall detection, and terminal events
plugins/codex/scripts/codex-companion.mjs
Passes onNotification into task execution, maintains per-job event stream with sequence counting and lastEventAt timestamp tracking. Stall watchdog emits {type:"watchdog", phase:"stuck"} after 60s inactivity; terminal {type:"job/exited"} event based on exit status (not exceptions) signals completion to listeners.
Events and compact command handlers
plugins/codex/scripts/codex-companion.mjs
Implements handleEvents() supporting --since, --after-seq, and --limit filtering with JSON or formatted line output. Implements handleCompact() for context recovery with resume instructions. Both routed via CLI dispatcher.
Command documentation and rescue default behavior
plugins/codex/agents/codex-rescue.md, plugins/codex/commands/compact.md, plugins/codex/commands/events.md, plugins/codex/commands/rescue.md
Documents /codex:events with NDJSON field semantics and phase meanings. Documents /codex:compact with recovery sequence and output formats. Updates rescue and codex-rescue to default to --background execution with instructions to poll via events endpoint.
Test isolation, command tests, and events command coverage
tests/helpers.mjs, tests/commands.test.mjs, tests/events-command.test.mjs
Removes host-injected CLAUDE_PLUGIN_DATA and CODEX_COMPANION_SESSION_ID from test environment. Updates command tests to expect new compact.md and events.md and background defaults. Adds 121-line events command test suite covering missing job-id, unknown jobs, incremental filtering, output formats, and result limits.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 A rabbit hops through event logs,
Watching jobs stream NDJSON fog,
When prompts grow long and Codex stalls,
We compact threads and poll through calls—
Observability for all! 🎉

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/event-stream-foundation

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

@suminerProxy suminerProxy merged commit c0a6926 into main May 19, 2026
1 check was pending
@suminerProxy suminerProxy deleted the feat/event-stream-foundation branch May 19, 2026 10:02
suminerProxy pushed a commit that referenced this pull request May 19, 2026
The 1.1.0 design doc's "可测验证 #2" calls for an integration test that
dispatches a slow codex task and asserts the events stream surfaces
`phase:"stuck"`. The behavior was previously verified only by manual
E2E (stallMs=8540 ms in a real codex run); this commit adds CI-gated
coverage against regressions.

Approach: spawn task-worker against a fake codex that responds to
thread/start + turn/start then deliberately stops sending notifications.
With CODEX_COMPANION_STALL_SECONDS=1 the worker should emit a watchdog
event within ~5-7 seconds (setInterval ticks every 5s). The test polls
events.ndjson via readJobEvents (not the CLI, to bypass any rendering
layer), and asserts the stuck record's stallMs meets the configured
threshold. The job is explicitly cancelled at teardown and the broker
is torn down via SessionEnd.

Extends tests/fake-codex-fixture.mjs with a hang-after-turn-start
behavior so future watchdog / timeout tests can reuse the same hang
pattern.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants