fix(runtime): clean up stale brokers and lock state updates by 60ke · Pull Request #343 · openai/codex-plugin-cc

60ke · 2026-05-22T09:34:31Z

Problem

Stale broker sessions could remove their socket and metadata without terminating the recorded broker process when no test-only killProcess override was supplied. State updates also used an unlocked load/mutate/save sequence, so concurrent workers could overwrite each other and drop job records.

Changes

Default broker teardown to terminateProcessTree and pass that default through stale-session and failed-start cleanup.
Add a small cross-process state lock around the full state transaction so concurrent upsertJob calls preserve each other.
Add regression coverage for stale broker termination and concurrent job state writes.

Testing

node --test tests/broker-lifecycle.test.mjs tests/state.test.mjs
node --test tests/broker-endpoint.test.mjs tests/bump-version.test.mjs tests/commands.test.mjs tests/git.test.mjs tests/process.test.mjs tests/render.test.mjs
git diff --cached --check

Note: local npm run build reaches tsc but fails on generated app-server type drift unrelated to this change (requestAttestation / experimentalRawEvents).

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6142e6cd08

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

60ke · 2026-05-22T09:41:31Z

Good catch. The timeout needs to be longer than the stale-lock recovery window, otherwise a fresh orphaned lock can make every writer fail before recovery is allowed. I updated the PR so STATE_LOCK_TIMEOUT_MS is derived from STATE_LOCK_STALE_MS + 5000, then reran node --test tests/broker-lifecycle.test.mjs tests/state.test.mjs and git diff --check.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bc4de41deb

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-22T09:46:20Z

+  killProcess = terminateProcessTree
+}) {
+  if (Number.isFinite(pid)) {
    try {
      killProcess(pid);


Validate stale-session PID before killing process tree

This change makes teardownBrokerSession kill any finite persisted pid by default, but ensureBrokerSession can reach this path using stale broker.json data when the endpoint is no longer reachable. In that scenario, PID reuse can cause terminateProcessTree to signal an unrelated process (or process group) that now owns the same PID, so simply restarting the companion after a crash can terminate the wrong workload. Please gate termination on broker identity (for example, command/start-time checks) before sending signals.

Useful? React with 👍 / 👎.

Good catch. I updated the broker teardown path so the default signal is now gated on broker identity before calling terminateProcessTree: the pid file must still point at the same pid and the live command line must look like app-server-broker.mjs serve for the same endpoint/pid file. If that check fails, teardown only removes the stale metadata/socket files and leaves the process alone.

I also added regression coverage for both sides: one test verifies a matching stale broker is terminated, and another verifies a reused stale pid running unrelated Node code is not killed. Reran node --test tests/broker-lifecycle.test.mjs tests/state.test.mjs and git diff --check.

Pelton · 2026-05-22T22:53:53Z

Hitting the same broker-leak symptoms on macOS (Claude Code with codex@openai-codex v1.0.4): app-server-broker.mjs plus node codex app-server and the Rust codex app-server survive past SessionEnd, reparent to launchd (PPID 1), and accumulate across sessions.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: df73eb3129

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-23T04:38:31Z

+    if (Date.now() - stats.mtimeMs > STATE_LOCK_STALE_MS) {
+      fs.rmSync(lockDir, { recursive: true, force: true });


Preserve active lock owners during stale-lock recovery

removeStaleLock expires .state.lock purely by directory mtime, so any legitimate state transaction that runs longer than 30s can have its live lock deleted by a competing process. After that, both writers can execute updateState concurrently, and the first writer’s release() can also remove the second writer’s lock, reintroducing the lost-update corruption this patch is trying to prevent. This occurs whenever a lock holder is slow (e.g., filesystem stalls or long synchronous mutation work) but still alive.

Useful? React with 👍 / 👎.

Good catch. I updated the state lock to preserve live owners during stale-lock recovery. The lock owner file now records both pid and a per-acquisition token; stale recovery only removes an old lock if the owner PID is no longer alive, and release() only removes the lock if the token still belongs to the releasing process. That prevents a slow-but-live writer from being displaced and prevents an old owner from deleting a successor’s lock.

I also added a regression test that creates a stale-looking lock owned by the live test process and verifies a competing upsertJob times out without deleting it. Reran node --test tests/broker-lifecycle.test.mjs tests/state.test.mjs and git diff --check.

60ke · 2026-05-23T04:40:02Z

Hitting the same broker-leak symptoms on macOS (Claude Code with codex@openai-codex v1.0.4): app-server-broker.mjs plus node codex app-server and the Rust codex app-server survive past SessionEnd, reparent to launchd (PPID 1), and accumulate across sessions.

Thanks for the macOS details. The latest patch keeps the SessionEnd/stale cleanup behavior aimed at the broker leak, but now validates the persisted pid before signalling: the live command has to match this broker for the same endpoint/pid file. That should still reap the leaked broker process group while avoiding PID-reuse kills of unrelated workloads.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8aea41f757

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-23T04:46:22Z

+  if (options.platform === "win32") {
+    return null;


Do not disable stale broker termination on Windows

getProcessCommand returns null for win32, while teardownBrokerSession now only invokes killProcess when validateProcess(...) passes. In ensureBrokerSession, this means stale or failed broker sessions never terminate their recorded PID on Windows, so orphaned broker processes can survive restarts even though session files are cleaned up.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-23T04:46:22Z

+    if (Date.now() - stats.mtimeMs > STATE_LOCK_STALE_MS && !processExists(owner?.pid)) {
+      fs.rmSync(lockDir, { recursive: true, force: true });


Guard stale-lock recovery against PID reuse

removeStaleLock only checks whether the recorded PID currently exists before deciding a lock is still active. If a crashed lock owner's PID gets reused by an unrelated process, .state.lock is treated as live forever and subsequent upsertJob/updateState calls can repeatedly time out instead of recovering the orphaned lock.

Useful? React with 👍 / 👎.

60ke requested a review from a team May 22, 2026 09:34

chatgpt-codex-connector Bot reviewed May 22, 2026

View reviewed changes

Comment thread plugins/codex/scripts/lib/state.mjs Outdated

60ke force-pushed the codex/fix-broker-state-races branch from 6142e6c to bc4de41 Compare May 22, 2026 09:41

chatgpt-codex-connector Bot reviewed May 22, 2026

View reviewed changes

60ke force-pushed the codex/fix-broker-state-races branch from bc4de41 to df73eb3 Compare May 23, 2026 04:34

chatgpt-codex-connector Bot reviewed May 23, 2026

View reviewed changes

fix(runtime): clean up stale brokers and lock state updates

8aea41f

60ke force-pushed the codex/fix-broker-state-races branch from df73eb3 to 8aea41f Compare May 23, 2026 04:43

chatgpt-codex-connector Bot reviewed May 23, 2026

View reviewed changes

		if (Date.now() - stats.mtimeMs > STATE_LOCK_STALE_MS) {
		fs.rmSync(lockDir, { recursive: true, force: true });

Conversation

60ke commented May 22, 2026

Problem

Changes

Testing

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

60ke commented May 22, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

60ke May 23, 2026

Choose a reason for hiding this comment

Uh oh!

Pelton commented May 22, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

60ke May 23, 2026

Choose a reason for hiding this comment

Uh oh!

60ke commented May 23, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants