fix(runtime): clean up stale brokers and lock state updates#343
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6142e6cd08
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
6142e6c to
bc4de41
Compare
|
Good catch. The timeout needs to be longer than the stale-lock recovery window, otherwise a fresh orphaned lock can make every writer fail before recovery is allowed. I updated the PR so |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bc4de41deb
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| killProcess = terminateProcessTree | ||
| }) { | ||
| if (Number.isFinite(pid)) { | ||
| try { | ||
| killProcess(pid); |
There was a problem hiding this comment.
Validate stale-session PID before killing process tree
This change makes teardownBrokerSession kill any finite persisted pid by default, but ensureBrokerSession can reach this path using stale broker.json data when the endpoint is no longer reachable. In that scenario, PID reuse can cause terminateProcessTree to signal an unrelated process (or process group) that now owns the same PID, so simply restarting the companion after a crash can terminate the wrong workload. Please gate termination on broker identity (for example, command/start-time checks) before sending signals.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch. I updated the broker teardown path so the default signal is now gated on broker identity before calling terminateProcessTree: the pid file must still point at the same pid and the live command line must look like app-server-broker.mjs serve for the same endpoint/pid file. If that check fails, teardown only removes the stale metadata/socket files and leaves the process alone.
I also added regression coverage for both sides: one test verifies a matching stale broker is terminated, and another verifies a reused stale pid running unrelated Node code is not killed. Reran node --test tests/broker-lifecycle.test.mjs tests/state.test.mjs and git diff --check.
|
Hitting the same broker-leak symptoms on macOS (Claude Code with |
bc4de41 to
df73eb3
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: df73eb3129
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if (Date.now() - stats.mtimeMs > STATE_LOCK_STALE_MS) { | ||
| fs.rmSync(lockDir, { recursive: true, force: true }); |
There was a problem hiding this comment.
Preserve active lock owners during stale-lock recovery
removeStaleLock expires .state.lock purely by directory mtime, so any legitimate state transaction that runs longer than 30s can have its live lock deleted by a competing process. After that, both writers can execute updateState concurrently, and the first writer’s release() can also remove the second writer’s lock, reintroducing the lost-update corruption this patch is trying to prevent. This occurs whenever a lock holder is slow (e.g., filesystem stalls or long synchronous mutation work) but still alive.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch. I updated the state lock to preserve live owners during stale-lock recovery. The lock owner file now records both pid and a per-acquisition token; stale recovery only removes an old lock if the owner PID is no longer alive, and release() only removes the lock if the token still belongs to the releasing process. That prevents a slow-but-live writer from being displaced and prevents an old owner from deleting a successor’s lock.
I also added a regression test that creates a stale-looking lock owned by the live test process and verifies a competing upsertJob times out without deleting it. Reran node --test tests/broker-lifecycle.test.mjs tests/state.test.mjs and git diff --check.
Thanks for the macOS details. The latest patch keeps the SessionEnd/stale cleanup behavior aimed at the broker leak, but now validates the persisted pid before signalling: the live command has to match this broker for the same endpoint/pid file. That should still reap the leaked broker process group while avoiding PID-reuse kills of unrelated workloads. |
df73eb3 to
8aea41f
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8aea41f757
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if (options.platform === "win32") { | ||
| return null; |
There was a problem hiding this comment.
Do not disable stale broker termination on Windows
getProcessCommand returns null for win32, while teardownBrokerSession now only invokes killProcess when validateProcess(...) passes. In ensureBrokerSession, this means stale or failed broker sessions never terminate their recorded PID on Windows, so orphaned broker processes can survive restarts even though session files are cleaned up.
Useful? React with 👍 / 👎.
| if (Date.now() - stats.mtimeMs > STATE_LOCK_STALE_MS && !processExists(owner?.pid)) { | ||
| fs.rmSync(lockDir, { recursive: true, force: true }); |
There was a problem hiding this comment.
Guard stale-lock recovery against PID reuse
removeStaleLock only checks whether the recorded PID currently exists before deciding a lock is still active. If a crashed lock owner's PID gets reused by an unrelated process, .state.lock is treated as live forever and subsequent upsertJob/updateState calls can repeatedly time out instead of recovering the orphaned lock.
Useful? React with 👍 / 👎.
Problem
Stale broker sessions could remove their socket and metadata without terminating the recorded broker process when no test-only
killProcessoverride was supplied. State updates also used an unlocked load/mutate/save sequence, so concurrent workers could overwrite each other and drop job records.Changes
terminateProcessTreeand pass that default through stale-session and failed-start cleanup.upsertJobcalls preserve each other.Testing
node --test tests/broker-lifecycle.test.mjs tests/state.test.mjsnode --test tests/broker-endpoint.test.mjs tests/bump-version.test.mjs tests/commands.test.mjs tests/git.test.mjs tests/process.test.mjs tests/render.test.mjsgit diff --cached --checkNote: local
npm run buildreachestscbut fails on generated app-server type drift unrelated to this change (requestAttestation/experimentalRawEvents).