feat: per-job event stream + watchdog + compact recovery (1.1.0)#2
Conversation
Host plugin runtimes (e.g. Claude Code) inject CLAUDE_PLUGIN_DATA and
CODEX_COMPANION_SESSION_ID into the plugin process. When these leak into
`npm test`, two tests fail with non-bug assertions:
- tests/state.test.mjs: "resolveStateDir uses a temp-backed per-workspace
directory" — CLAUDE_PLUGIN_DATA redirects state to ~/.claude/plugins/data
instead of os.tmpdir().
- tests/runtime.test.mjs: "result without a job id ..." — fixture writer
and the spawned `result` subprocess look at different state dirs.
CI passes (no host-runtime env), so this only bites contributors running
`npm test` from inside a plugin host. Fix: delete these env vars at the top
of tests/helpers.mjs (the shared import) so every test gets a clean slate.
Tests that exercise the env var explicitly ("resolveStateDir uses
CLAUDE_PLUGIN_DATA when it is provided") already use try/finally to set and
restore it, so they continue to pass.
Verified: 86/86 tests pass with the polluting env vars still set in the
parent shell.
Add resolveJobEventsFile / appendJobEvent / readJobEvents for the upcoming
Claude-main-loop observability work (see DESIGN doc at user planning area).
Contract:
- Per-job events file at {stateDir}/jobs/{jobId}.events.ndjson.
- appendJobEvent caps each line at 4KB. POSIX `write(fd, buf, n)` with
O_APPEND is atomic when n < PIPE_BUF (~4096 on Linux/macOS), so single-
write append is safe across concurrent readers. Oversized lines first
truncate the `raw` field; if still too big, the whole event is replaced
with a minimal {type: "oversize-event-elided", seq, ts, method, phase}.
- readJobEvents returns [] for missing files. Partial last line (writer
mid-write before the trailing \n) is tolerated — JSON.parse failure on
the trailing line is skipped, next read picks it up. afterSeq takes
precedence over since when both are passed.
This commit is consumer-less; the producer side (captureTurn onNotification
hook) lands in the next commit so this slice can ship green independently.
Tests: 10 new unit tests covering path resolution, append+read, missing
file, afterSeq filter, since filter, afterSeq precedence over since, limit,
partial-line tolerance, raw truncation, and oversize elision. 96/96 pass.
Three changes to codex.mjs, building on the per-job NDJSON event API:
1. New exported `normalizeNotification(state, message)` function — pure
transform from app-server notifications to the flat event shape that
gets appended to {jobId}.events.ndjson. Covers thread/started,
thread/name/updated, turn/started, item/started, item/completed,
error, turn/completed, and an "unknown" fallback for forward-compat
methods (e.g. thread/compact/started). Phase inference reuses the
existing describeStartedItem / describeCompletedItem maps so the
vocabulary stays consistent with on-screen progress text.
2. `captureTurn` accepts an `options.onNotification` callback. A small
`dispatch(message)` wrapper applies state mutation first, then emits
the normalized event. Order is load-bearing: normalize reads
state.threadTurnIds populated by applyTurnNotification. onNotification
errors are swallowed so a broken consumer can never crash the worker.
3. `runAppServerTurn` forwards `options.onNotification` to captureTurn
and surfaces `usage` (from turnState.finalTurn?.usage) as a top-level
field on the result, so /codex:status and the events stream can read
token usage without traversing the nested `turn` payload.
Tests: 13 new unit tests covering each method branch, phase inference
edge cases (commandExecution -> running vs verifying), unknown-method
fallback, and malformed-input resilience. 109/109 pass.
No behavioral change for callers that don't pass onNotification — the
hook is fully opt-in. The companion-side wiring lands separately.
End-to-end wiring of the observability path landed in the prior two commits
(state.mjs events API + codex.mjs notification hook).
In codex-companion.mjs:
- executeTaskRun: forward `onNotification` through to runAppServerTurn.
Sync foreground callers still don't pass one; only the background
task-worker path opts in.
- handleTaskWorker: build a same-process closure that
(a) appends every normalized notification to {jobId}.events.ndjson
with a monotonic per-job seq number;
(b) reflects phase transitions into state.json via upsertJob (only on
real phase change, to keep the unlocked single-flight writer rare);
(c) runs a 5s stall watchdog — when the gap since the last event
exceeds CODEX_COMPANION_STALL_SECONDS (default 60s), it emits one
{type:"watchdog", phase:"stuck"} record and flips job phase to
stuck. It does NOT cancel. Main-loop Claude decides what to do
next (continue / compact / cancel).
(d) on terminal exit (success OR failure), emits a single
{type:"job/exited", phase:"completed"|"failed", exitCode, errorMessage}
record so a polling reader can distinguish "still slow" from
"already finished". The job.status state.json field is no longer
the sole source of truth — main-loop Claude must look for the
job/exited event.
Observability errors are swallowed; they must never crash the worker.
Watchdog interval is unref'd so it can't keep the event loop alive.
- New `events <job-id> [--since|--after-seq|--limit] [--json]` subcommand
delegating to readJobEvents. Default output is human-readable lines;
--json returns {jobId, eventsFile, count, events}. After-seq + since
semantics match readJobEvents (after-seq takes precedence). printUsage
updated.
Tests: 7 integration tests spawning the real CLI to verify the events
command's empty-result, append-and-read, after-seq filtering, limit
capping, human-readable rendering, and missing-jobId usage paths.
116/116 pass.
The stall threshold is configurable via env var for now
(CODEX_COMPANION_STALL_SECONDS). A `task --max-stall-seconds <N>` CLI
flag is deferred to a later commit so handleTask's option parser stays
untouched in this slice.
E2E against real codex CLI 0.131.0-alpha.9 uncovered three issues mock
tests could not surface. Fixed all three; rerun confirms zero `unknown`
phase events in a successful turn.
1. handleTaskWorker job/exited bug (correctness)
runTrackedJob does NOT throw when codex returns a failed turn — it
resolves with execution.exitStatus != 0 and writes state.status="failed"
out-of-band. Previous code keyed `completed` off the absence of an
exception, so a failed turn was misreported as `phase:"completed",
exitCode:0` in the terminal job/exited event. This is the single
record main-loop Claude reads to decide "did this work?" — getting
it wrong led to silent false positives. Fix: inspect
execution.exitStatus directly. Bug repro: dispatch with --effort
minimal (which conflicts with web_search) — codex returns an
invalid_request_error, runTrackedJob resolves with exitStatus=1,
stored state.json shows status="failed", but the prior code emitted
{phase:"completed", exitCode:0}. After fix: {phase:"failed",
exitCode:1, errorMessage:"Task did not complete successfully ..."}.
2. normalize coverage gaps (forward-compat / readability)
codex 0.131 emits three notification methods + three item types the
prior switch did not recognize, leaving them as phase:"unknown" with
generic fallback messages:
- method `thread/status/changed` with status.type in {"idle"} →
phase:"idle", message "Thread idle.". Post-turn quiescence signal.
- method `thread/tokenUsage/updated` → phase:"metering" (new word in
the phase vocab). Streaming token-usage source. This is the real
event main-loop Claude should poll to detect context-budget pressure
before turn/completed surfaces final usage. NOTE: codex 0.131 payload
shape for the usage object is not documented; normalize tries
{inputTokens,outputTokens,cachedInputTokens} and several aliases;
when no recognized keys are found, falls back to a stable label and
preserves raw. Concrete schema discovery is a Phase 3 follow-up.
- item.type `agentMessage` / `assistantMessage` / `reasoning` →
phase:"thinking". `agentMessage` is codex's final-reply item;
describeStartedItem/describeCompletedItem now use a new
extractItemText helper to surface a content preview ("Codex
replied: pong") so main-loop Claude can recognize the answer
from the event stream without fetching /codex:result.
3. Single E2E rerun verification
Same prompt with --effort low (avoids the minimal+web_search
constraint): 13 events emitted, phase distribution {thinking:8,
completed:2, idle:1, metering:1, warning:1, unknown:0}. The prior
normalize generated 5 unknown out of 9 events (56%); this commit
brings it to 0/13 on the happy path.
Tests: +6 new normalize unit tests (status idle, tokenUsage with usage,
tokenUsage without usage, agentMessage with text, agentMessage with
content[].text, schema fallback). 128/128 pass total. The job/exited
fix is exercised in the existing fake-codex fixture indirectly; a
direct unit test would need a fake runTrackedJob that simulates
"resolved with non-zero exitStatus" which is non-trivial to mock — for
now the E2E repro and the inline reasoning in the comment are the
documented evidence.
Closes the "operate codex like a Claude subagent" half of the design
contract. Phase 1+2 gave the main loop an event stream; Phase 3 gives
it the slash-command surface and the protocol-native recovery path.
In lib/codex.mjs:
- New exported compactAppServerThread(cwd, {threadId}). Wraps codex
app-server's thread/compact/start RPC. This is the protocol-native
recovery for "prompt too long" — main-loop Claude calls it after a
turn fails with context-overflow, then resumes via /codex:rescue
--resume <amended prompt>. Fire-and-return: the call awaits the
app-server's ack but does not consume the streaming response. The
broker recognizes thread/compact/start as STREAMING_METHOD but
routes the stream to whoever owns it at that moment; compaction
itself completes on the codex side regardless of consumer presence.
Uses reuseExistingBroker: true so it can punch through to an
already-running broker if one exists. The exact success payload
shape is undocumented in codex CLI 0.131; result is preserved
verbatim under .result for forward-compat.
In codex-companion.mjs:
- New `compact <thread-id> [--json]` subcommand wrapping
compactAppServerThread. Plain output prints the operation result and
hints the resume flow; --json returns the full structured report.
Smoke-tested against real codex 0.131: bogus thread id correctly
returns attempted:true, compacted:false, with codex's own error
("invalid thread id") preserved under .detail.
In agents/codex-rescue.md + commands/rescue.md:
- Default execution mode flipped from foreground to background. The
prior heuristic ("small bounded => foreground; complex => background")
was the deadlock root cause: a small task that stalls is still a
deadlock, and "small" is unknowable in advance. Background is the
safe default — the main loop polls /codex:events <job-id> for
progress instead of blocking on the synchronous Bash call. --wait is
honored when the user explicitly asks for foreground.
In commands/:
- New events.md: slash command surfacing the per-job event stream.
Documents the {type:"job/exited"} terminal-state contract, the
phase:"stuck" watchdog signal, the phase:"metering" token-usage
source, and the --after-seq incremental polling pattern.
- New compact.md: slash command for the recovery sequence. Documents
the typical "cancel → compact → resume with amended prompt" idiom
main-loop Claude should follow when codex hits context overflow.
In tests/commands.test.mjs:
- Updated assertions to reflect the new background default and to
include events.md + compact.md in the commands/ file-list invariant.
The prior assertions hard-coded the prose "default to foreground" and
the file list of 7 commands; both needed to track this change.
Tests: 128/128 pass. No new tests added — the compact path is exercised
indirectly via a smoke check (bogus thread id returns the expected
structured error against the real app-server). A full E2E for compact
needs a real turn first to obtain a valid thread id; that lands in
Phase 4 alongside version bump + CHANGELOG.
Cuts the minor-version release for the observability rework (feat/event-stream-foundation). Adds /codex:events, /codex:compact, rescue defaults to --background, stall watchdog, terminal job/exited events, top-level token usage, and broader codex CLI 0.131 notification coverage. No breaking changes — all existing commands keep the same signatures and outputs; the new event stream is additive.
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (16)
📝 WalkthroughWalkthroughThis PR implements Codex plugin v1.1.0, introducing event-driven job observability via NDJSON streaming, context compaction for "prompt too long" recovery, and changes rescue defaults to non-blocking background execution with polling support. The release bumps manifests, adds event storage and normalization infrastructure, wires notifications through task execution, and exposes new CLI commands for monitoring and recovery. ChangesCodex Plugin v1.1.0: Event Streaming, Context Compaction, and Job Lifecycle Observability
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
The 1.1.0 design doc's "可测验证 #2" calls for an integration test that dispatches a slow codex task and asserts the events stream surfaces `phase:"stuck"`. The behavior was previously verified only by manual E2E (stallMs=8540 ms in a real codex run); this commit adds CI-gated coverage against regressions. Approach: spawn task-worker against a fake codex that responds to thread/start + turn/start then deliberately stops sending notifications. With CODEX_COMPANION_STALL_SECONDS=1 the worker should emit a watchdog event within ~5-7 seconds (setInterval ticks every 5s). The test polls events.ndjson via readJobEvents (not the CLI, to bypass any rendering layer), and asserts the stuck record's stallMs meets the configured threshold. The job is explicitly cancelled at teardown and the broker is torn down via SessionEnd. Extends tests/fake-codex-fixture.mjs with a hang-after-turn-start behavior so future watchdog / timeout tests can reuse the same hang pattern.
Why
When Claude Code's main loop delegates a coding task to Codex via `/codex:rescue`, the subagent makes a single synchronous `Bash` call into the companion `task` command and waits for it to return. If Codex stalls (prompt too long, model hang, tool deadlock), the main loop stalls too — there is no out-of-band signal it can poll to detect the freeze.
This PR adds the missing channel: a per-job NDJSON event stream the main Claude loop can poll with `/codex:events`, plus a stall watchdog that emits `phase:"stuck"` when codex goes quiet, and a protocol-native `/codex:compact` recovery path for context overflow. `/codex:rescue` now defaults to `--background` so the main loop never blocks on a synchronous Bash call.
What ships in 1.1.0
Includes #1 (test isolation) as a prerequisite
This branch contains 8 commits, the first of which (`5e8100a`) is the same fix as #1. Merging this PR alone brings both fixes in; merging #1 first and then this PR is also fine (git will skip the already-applied commit).
Verification
Open schema unknowns (do not block merge)
Build status
Pre-existing upstream incompatibility with codex CLI 0.131 (`requestAttestation` / `experimentalRawEvents` schema drift) makes `npm run build` fail on `main` too. Not introduced by this PR; intentionally not addressed to keep the diff focused.
(Previously opened as openai#335 — closed in favor of this fork-internal PR so the repo owner can self-merge.)
Summary by CodeRabbit
Release Notes v1.1.0
New Features
/codex:eventscommand for streaming per-job notifications and progress monitoring/codex:compactcommand for context overflow recoveryBug Fixes
/codex:rescuenow defaults to background execution to prevent deadlocksDocumentation