Skip to content

fix(runtime): clean up stale brokers and lock state updates#343

Open
60ke wants to merge 1 commit into
openai:mainfrom
60ke:codex/fix-broker-state-races
Open

fix(runtime): clean up stale brokers and lock state updates#343
60ke wants to merge 1 commit into
openai:mainfrom
60ke:codex/fix-broker-state-races

Conversation

@60ke
Copy link
Copy Markdown

@60ke 60ke commented May 22, 2026

Problem

Stale broker sessions could remove their socket and metadata without terminating the recorded broker process when no test-only killProcess override was supplied. State updates also used an unlocked load/mutate/save sequence, so concurrent workers could overwrite each other and drop job records.

Changes

  • Default broker teardown to terminateProcessTree and pass that default through stale-session and failed-start cleanup.
  • Add a small cross-process state lock around the full state transaction so concurrent upsertJob calls preserve each other.
  • Add regression coverage for stale broker termination and concurrent job state writes.

Testing

  • node --test tests/broker-lifecycle.test.mjs tests/state.test.mjs
  • node --test tests/broker-endpoint.test.mjs tests/bump-version.test.mjs tests/commands.test.mjs tests/git.test.mjs tests/process.test.mjs tests/render.test.mjs
  • git diff --cached --check

Note: local npm run build reaches tsc but fails on generated app-server type drift unrelated to this change (requestAttestation / experimentalRawEvents).

@60ke 60ke requested a review from a team May 22, 2026 09:34
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6142e6cd08

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread plugins/codex/scripts/lib/state.mjs Outdated
@60ke 60ke force-pushed the codex/fix-broker-state-races branch from 6142e6c to bc4de41 Compare May 22, 2026 09:41
@60ke
Copy link
Copy Markdown
Author

60ke commented May 22, 2026

Good catch. The timeout needs to be longer than the stale-lock recovery window, otherwise a fresh orphaned lock can make every writer fail before recovery is allowed. I updated the PR so STATE_LOCK_TIMEOUT_MS is derived from STATE_LOCK_STALE_MS + 5000, then reran node --test tests/broker-lifecycle.test.mjs tests/state.test.mjs and git diff --check.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bc4de41deb

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 181 to 185
killProcess = terminateProcessTree
}) {
if (Number.isFinite(pid)) {
try {
killProcess(pid);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Validate stale-session PID before killing process tree

This change makes teardownBrokerSession kill any finite persisted pid by default, but ensureBrokerSession can reach this path using stale broker.json data when the endpoint is no longer reachable. In that scenario, PID reuse can cause terminateProcessTree to signal an unrelated process (or process group) that now owns the same PID, so simply restarting the companion after a crash can terminate the wrong workload. Please gate termination on broker identity (for example, command/start-time checks) before sending signals.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I updated the broker teardown path so the default signal is now gated on broker identity before calling terminateProcessTree: the pid file must still point at the same pid and the live command line must look like app-server-broker.mjs serve for the same endpoint/pid file. If that check fails, teardown only removes the stale metadata/socket files and leaves the process alone.

I also added regression coverage for both sides: one test verifies a matching stale broker is terminated, and another verifies a reused stale pid running unrelated Node code is not killed. Reran node --test tests/broker-lifecycle.test.mjs tests/state.test.mjs and git diff --check.

@Pelton
Copy link
Copy Markdown

Pelton commented May 22, 2026

Hitting the same broker-leak symptoms on macOS (Claude Code with codex@openai-codex v1.0.4): app-server-broker.mjs plus node codex app-server and the Rust codex app-server survive past SessionEnd, reparent to launchd (PPID 1), and accumulate across sessions.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: df73eb3129

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread plugins/codex/scripts/lib/state.mjs Outdated
Comment on lines +109 to +110
if (Date.now() - stats.mtimeMs > STATE_LOCK_STALE_MS) {
fs.rmSync(lockDir, { recursive: true, force: true });
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve active lock owners during stale-lock recovery

removeStaleLock expires .state.lock purely by directory mtime, so any legitimate state transaction that runs longer than 30s can have its live lock deleted by a competing process. After that, both writers can execute updateState concurrently, and the first writer’s release() can also remove the second writer’s lock, reintroducing the lost-update corruption this patch is trying to prevent. This occurs whenever a lock holder is slow (e.g., filesystem stalls or long synchronous mutation work) but still alive.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I updated the state lock to preserve live owners during stale-lock recovery. The lock owner file now records both pid and a per-acquisition token; stale recovery only removes an old lock if the owner PID is no longer alive, and release() only removes the lock if the token still belongs to the releasing process. That prevents a slow-but-live writer from being displaced and prevents an old owner from deleting a successor’s lock.

I also added a regression test that creates a stale-looking lock owned by the live test process and verifies a competing upsertJob times out without deleting it. Reran node --test tests/broker-lifecycle.test.mjs tests/state.test.mjs and git diff --check.

@60ke
Copy link
Copy Markdown
Author

60ke commented May 23, 2026

Hitting the same broker-leak symptoms on macOS (Claude Code with codex@openai-codex v1.0.4): app-server-broker.mjs plus node codex app-server and the Rust codex app-server survive past SessionEnd, reparent to launchd (PPID 1), and accumulate across sessions.

Thanks for the macOS details. The latest patch keeps the SessionEnd/stale cleanup behavior aimed at the broker leak, but now validates the persisted pid before signalling: the live command has to match this broker for the same endpoint/pid file. That should still reap the leaked broker process group while avoiding PID-reuse kills of unrelated workloads.

@60ke 60ke force-pushed the codex/fix-broker-state-races branch from df73eb3 to 8aea41f Compare May 23, 2026 04:43
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8aea41f757

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +123 to +124
if (options.platform === "win32") {
return null;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Do not disable stale broker termination on Windows

getProcessCommand returns null for win32, while teardownBrokerSession now only invokes killProcess when validateProcess(...) passes. In ensureBrokerSession, this means stale or failed broker sessions never terminate their recorded PID on Windows, so orphaned broker processes can survive restarts even though session files are cleaned up.

Useful? React with 👍 / 👎.

Comment on lines +139 to +140
if (Date.now() - stats.mtimeMs > STATE_LOCK_STALE_MS && !processExists(owner?.pid)) {
fs.rmSync(lockDir, { recursive: true, force: true });
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Guard stale-lock recovery against PID reuse

removeStaleLock only checks whether the recorded PID currently exists before deciding a lock is still active. If a crashed lock owner's PID gets reused by an unrelated process, .state.lock is treated as live forever and subsequent upsertJob/updateState calls can repeatedly time out instead of recovering the orphaned lock.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants